lnx-search / lnx Goto Github PK
View Code? Open in Web Editor NEW⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable, typo tollerant deployment of the tantivy search engine.
Home Page: https://lnx.rs
License: MIT License
⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable, typo tollerant deployment of the tantivy search engine.
Home Page: https://lnx.rs
License: MIT License
Generally a pretty big thing would be to add Prometheus metrics allowing people to use things like grafana to track in flight connections, latencies, etc... This could probably be implemented along side the telemetry data issue.
At the moment theirs a distinct lack of synonym support and generally im not sure how to go about implementing this short of some wildly inefficient system.
For example:
{
"status": 200,
"data": {
"hits": [
{
"doc": {
"author": [
"248b2e6a-7c36-4da3-bcc4-55a979eb57dc"
],
"id": [
18
],
"title": [
"title 01"
],
"uuid": [
"06dbf5c7-d313-413d-8f65-49aed93e4031"
]
},
"document_id": "1628525110829290421",
"score": 1.542423
},
{
"doc": {
"author": [
"248b2e6a-7c36-4da3-bcc4-55a979eb57dc"
],
"id": [
19
],
"title": [
"title 02"
],
"uuid": [
"8da05387-8727-4a27-baa7-265af7558c0c"
]
},
"document_id": "1493516234521670736",
"score": 1.542423
},
{
"doc": {
"author": [
"248b2e6a-7c36-4da3-bcc4-55a979eb57dc"
],
"id": [
20
],
"title": [
"title 03"
],
"uuid": [
"3bf64ee1-f2ac-46ce-8e45-0d25956b195c"
]
},
"document_id": "9603160257558085701",
"score": 1.542423
}
],
"count": 3,
"time_taken": 0.000578893
}
}
I think it would make much more sense to show the doc as it has been posted.
Attempting to perform a search where the order_by
field is a date leads to an error:
{"data":"Schema error: 'Field \"ts\" is of type I64!=Date'","status":400}
It looks this is because the FieldValue
is implied to be i64:
lnx/engine/src/index/reader.rs
Lines 549 to 552 in 8d38d38
The docs still target
Download the file via git clone https://github.com/ChillFish8/lnx.git
which is incorrect, this should be changed to the new repo url
So in this bug, you are right that was the cause. I copied this example from the book and assumed I was hitting the same issue. What I am actually seeing is that when I index a document with a date field, I am no longer able to index any more documents.
# curl -X DELETE 'http://localhost:4040/indexes/my-index'
{"data":"index deleted","status":200}#
# cat a.json
{
"name": "my-index",
"writer_buffer": 6000000,
"writer_threads": 1,
"reader_threads": 1,
"max_concurrency": 10,
"search_fields": [
"title"
],
"storage_type": "memory",
"use_fast_fuzzy": false,
"strip_stop_words": false,
"set_conjunction_by_default": false,
"fields": {
"title": {
"type": "text",
"stored": true
},
"description": {
"type": "text",
"stored": true
},
"id": {
"type": "u64",
"indexed": true,
"stored": true,
"fast": "single"
},
"ts": {
"type": "date",
"stored": false,
"indexed": true,
"fast": "single"
}
},
"boost_fields": {}
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://127.0.0.1:4040/indexes
{"data":"index created","status":200}
# cat c.json
{
"title": ["Hello, World2"],
"id":[4]
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"added documents","status":200}
# curl -X POST 'http://localhost:4040/indexes/my-index/commit'
{"data":"changes committed","status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=*&mode=normal'
{"data":{"count":1,"hits":[{"doc":{"id":[4],"title":["Hello, World2"]},"document_id":"8295453496340348446","ratio":1.0}],"time_taken":0.0001392010017298162},"status":200}
# cat b.json
{
"title": ["Hello, World2"],
"id":[4],
"ts":[1630097583]
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"added documents","status":200}
# curl -X POST 'http://localhost:4040/indexes/my-index/commit'
{"data":"changes committed","status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=*&mode=normal'
{"data":{"count":1,"hits":[{"doc":{"id":[4],"title":["Hello, World2"]},"document_id":"8295453496340348446","ratio":1.0}],"time_taken":0.0001936009939527139},"status":200}
Adding a document with a date field doesn't produce an error, but seems to corrupt the index. In my original setup, I always had a date field, and I wasn't seeing any documents get indexed, which is why I assumed the two errors were the same. Once this happens even documents without the ts
field fail to be indexed.
Originally posted by @miyachan in #14 (comment)
I'm not sure if you are just suppressing all log output - but in my testing I found that the log message in handle_msg
(
lnx/engine/src/index/writer.rs
Line 102 in f290cb8
info!
to trace!
. I don't find that the messages are all that useful enough to be info!
either.You should be able to send Gzipped data across to lnx to save network bandwidth and increase the transfer rate.
Seeing that we can do fuzzy matching a spell correction with the fast fuzzy system. We can produce a set of results that have been corrected with the context of the corpus data. This could potentially be incredibly useful for getting more accurate results e.g.
"th trueman shew"
would become "the truman show"
according to the movies dataset.
For some reason when defining an index and providing given search fields. The query parser isn't being set to use them.
Currently, lnx will create a set of rayon thread pools, this is fine for most cases but at higher concurrency levels this can start eating up an awfully large amount of CPU at idle which is fairly wasteful.
The solution would be a dynamically sized pool that grows and shrinks with the load up to a limit, this can help keep the usage size down.
By default the tantivy query parser treats multiple terms in the query as OR
terms - meaning a query like barack obama
will match documents containing only barack
or only obama
. Sometimes its desirable to only score documents in which barack
and obama
are present in a user facing search. Tantivy provides this functionality with the set_conjunction_by_default
parameter.
I think this could work as an index option or a query time query parameter.
Using master/0.9 beta and a INDEX with "use_fast_fuzzy": true results in corrupted data.
{
"override_if_exists": true,
"index": {
"name": "products",
"storage_type": "tempdir",
"fields": {
"title": {
"type": "text",
"stored": true
}
},
"search_fields": [],
"boost_fields": {},
"reader_threads": 1,
"max_concurrency": 1,
"writer_buffer": 300000,
"writer_threads": 1,
"set_conjunction_by_default": false,
"use_fast_fuzzy": true,
"strip_stop_words": false,
"auto_commit": 0
}
}
e.g: POST /indexes/products/documents with body
{"title":""}
POST /indexes/products/commit
I get a error message:
{"status":400,"data":"Data corrupted: 'Data corruption: : Failed to open field "title"'s term dictionary in the compos (truncated...)
to recover from this I rebuild the index and add the document with a dummy value
or I rebuild the index without using "use_fast_fuzzy".
The example above is not the real use case, usually titles on all my documents are set but they have some optional fields that are optional.
The Documentation that is available at https://docs.lnx.rs/ should be linked from the readme file and the old link to the book should be removed.
Running heavy tests with large datasets master
currently shows signs of a memory leak when running the 50 million amazon dataset.
0.7.1
Does not have this issue and successfully runs at ~9GB memory usage max when indexing the dataset and 4.8GB when complete.
The total data is ~26GB but ram usage on master crept up to 62GB
before being killed by OOM.
As of right now we use the Tantivy system to persist to disk which works well, the issue is we have several of these directories instead of having one manager that everything uses, this doesn't work well for maintainability or file management so this should probably change.
As of right now we cannot scale horizontally / do multi machine scaling at all. This is a big issue for larger work loads and really would be a good idea to implement, I've been looking at RAFT based setup which seem to work well but more research is needed.
Lnx Version: e9804944edc8a7c0af24ee3ba8397b87f1640b5f
I'm trying to somehow reproduce this as I'm not sure how it occurred. I have a system which adds documents to the index and commits every 10s. I was executing searches against the system (specifically I was testing which queries might cause an error, not sure if this is related):
$ curl 'http://localhost:4040/indexes/posts/search?query=text:f^oobar&mode=normal&limit=50&order_by=-ts'
{"data":"Syntax Error","status":400}
$ curl 'http://localhost:4040/indexes/posts/search?query=text:f`oobar&mode=normal&limit=50&order_by=-ts'
{"data":"channel closed","status":400}
logs:
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO - Deleted "8ae9f9e93c674678ae3e7ab694752231.fast"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO - Deleted "31b0bab77e014d539022907d36eac93c.fieldnorm"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO - Deleted "2dbb35c78423479290186b0fccb9b48e.fieldnorm"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO - Deleted "90fcfd004ee34f3892332a95d9c260e1.fast"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210320997 ] completed operation DELETE-TERM
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210320998 ] completed operation ADD-DOCUMENT
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210320999 ] completed operation ADD-DOCUMENT
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::file_watcher | INFO - Meta file "./lnx/index-data/posts/meta.json" was modified
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210321000 ] completed operation DELETE-TERM
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210321001 ] completed operation ADD-DOCUMENT
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210321002 ] completed operation DELETE-TERM
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210321003 ] completed operation ADD-DOCUMENT
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | engine::index::writer | INFO - [ WRITER @ posts ][ TRANSACTION 210321004 ] completed operation ADD-DOCUMENT
Aug 30 00:52:19 torako lnx[307715]: thread 'index-posts-worker-0' panicked at 'get executor', /root/lnx/engine/src/index/reader.rs:264:44
Aug 30 00:52:19 torako lnx[307715]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | lnx::routes | WARN - rejecting search index operation due to bad request: channel closed
My index settings are like so:
{
"writer_buffer": 144000000,
"writer_threads": 6,
"reader_threads": 6,
"max_concurrency": 12
}
Is it possible I exceeded the amount of concurrent requests allowed?
There is none, we should add some.
Currently, the system is fairly messy when re-creating schemas, this leads to the issue of #19 because of the conflicting schemas between our loaded schema and tantivy's schema.
Traditionally this doesn't cause any issues for temporary structures e.g. memory or tempfile but can be caused when we do a persistent index which leads to tantivy having one schema and us having the other. Which leads to some weird behavour.
The musl allocator has a bit of a legacy with being slower than most other allocators, we cant use JeMalloc due to some compile issues and it also creates a bit of a desync of performance across operating systems which i'd like to avoid.
MiMalloc supports both Unix and Win systems so it's probably work testing and viewing how this affects usage and performance.
If it's adequate it's probably a good idea to use it.
Although im personally not a fan of this it's certainly needed to be able to get a good idea of the areas to focus on, I think.
The data collected only really needs to be the average length of queries, type of query and amount of docs (plus index runtime settings) but thats about it, users should be able to opt out just by passing a flag e.g. --no-telemetry
This allows for alot of de-duplication of combination queries where you might want to apply the same terms to multiple fields.
I think it's also a good idea to add a general *
specialisation to allow for searching on all default fields.
I can't seem to get any results from lnx. I'm using commit e9804944edc8a7c0af24ee3ba8397b87f1640b5f
. I built lnx using cargo build --release
, then starting it with /usr/local/bin/lnx -p 4040
.
# cat a.json
{
"name": "my-index",
"writer_buffer": 6000000,
"writer_threads": 1,
"reader_threads": 1,
"max_concurrency": 10,
"search_fields": [
"title"
],
"storage_type": "memory",
"use_fast_fuzzy": true,
"strip_stop_words": true,
"fields": {
"title": {
"type": "text",
"stored": true
},
"description": {
"type": "text",
"stored": true
},
"id": {
"type": "u64",
"indexed": true,
"stored": true,
"fast": "single"
}
},
"boost_fields": {
"title": 2,
"description": 0.8
}
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://127.0.0.1:4040/indexes
{"data":"index created","status":200}
# cat b.json
{
"title": ["Hello, World"],
"description": ["Welcome to the next generation system."]
}
# curl -X POST -H "Content-Type: application/json" [email protected] http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"added documents","status":200}
# curl -X POST 'http://localhost:4040/indexes/my-index/commit'
{"data":"changes committed","status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=*'
{"data":{"count":0,"hits":[],"time_taken":0.001682035974226892},"status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=Hello'
{"data":{"count":0,"hits":[],"time_taken":0.00014333099534269422},"status":200}
I can't figure out what I'm doing wrong here.
Hiya!
I’m attempting to run LNX on Kubernetes as a stateful set, but running into an issue - the docs/example code suggests mounting a volume to etc/lnx
but doing this in Kubernetes causes the contents of that path to be replaced with the attached volume, which means the binary can’t be found.
Is there a way to parametise the storage path so that it doesn’t collide with binary path? 🙂
As it stands right now, you can create n
indexes which each create at a bare minimum 2-3
threads, this isn't great for efficiency both at idle and indexing.
If you create an index with a 12 thread writer and then another of the same number of threads then add documents to them, the system will essentially fight itself over the CPU time, this is not massively ideal slowing both indexing operations down vs queuing the operations one after another.
This not only improves the per index indexing performance but also makes indexes much cheaper to make dropping to only 1
thread in a best-case scenario, the #37 issue would basically make this 0
potentially allowing the creation of 'micro' indexes.
At the moment you can do 1 of the 3 (technically 4) query options, while this is fine for simple use, fuzzy searching alone may not be enough to fully filter the relevant docs, a combination setup would allow users to query the given fields via fuzzy search and the traditional queries like ranges, date lookups etc...
LNX: e980494
I'm trying to figure out how to reproduce this. After the panic I reported in #18, after restarting lnx, I noticed that my index schema that was being returned from the search query was messed up. Searching seems to work as advertised (as in searching field_a:foo
, will return documents where foo
is set in field_a
, but the schema of the search results is messed up).
For example, let's say I have schema where I am only storing 3 fields (field_a
, field_b
, field_c
), but I am indexing 6. before the restart (example)
$ curl 'http://localhost:4040/indexes/posts/search?query=field_a:foo&mode=normal&limit=50&order_by=-ts'
{"data":{"count":40,"hits":[{"doc":{"field_a":["foo"],"field_b":[4],"field_c":[44]}, # etc
Now that same query is returning:
$ curl 'http://localhost:4040/indexes/posts/search?query=field_a:foo&mode=normal&limit=50&order_by=-ts'
{"data":{"count":40,"hits":[{"doc":{"field_d":["foo"],"field_e":[4],"field_f":[44]}, # etc
The values are correct, but the name of the keys are completely different.
However, in trying to reproduce the error, it seems like my lnx install is corrupted. I tried to to create an index like so:
{
"name": "corrupt",
"writer_buffer": 144000000,
"writer_threads": 12,
"reader_threads": 12,
"max_concurrency": 24,
"search_fields": [
"field_a"
],
"storage_type": "filesystem",
"set_conjunction_by_default": true,
"use_fast_fuzzy": false,
"strip_stop_words": false,
"fields": {
"field_a": {
"type": "text",
"stored": true
},
"field_b": {
"type": "u64",
"stored": true,
"indexed": true,
"fast": "single"
},
"field_c": {
"type": "u64",
"stored": true,
"indexed": true,
"fast": "single"
},
"field_d": {
"type": "text",
"stored": false
},
"field_e": {
"type": "text",
"stored": false
},
"field_f": {
"type": "text",
"stored": false
},
"version": {
"type": "u64",
"stored": false,
"indexed": true,
"fast": "single"
}
},
"boost_fields": {}
}
then index these documents:
[
{"field_a":["foo"], "field_b":[4], "field_c":[44], "field_d":["macbook"], "field_e":["apple"], "field_f":["iphone"], "version":[1]},
{"field_a":["bar"], "field_b":[5], "field_c":[55], "field_d":["laptop"], "field_e":["micrsoft"], "field_f":["galaxy"], "version":[2]},
{"field_a":["redbull coke"], "field_b":[6], "field_c":[66], "field_d":["thinkpad"], "field_e":["netflix"], "field_f":["nexus"], "version":[3]},
{"field_a":["vodka sprite"], "field_b":[7], "field_c":[77], "field_d":["ultrabook"], "field_e":["facebook"], "field_f":["blackberry"], "version":[4]},
{"field_a":["ginger ale whiskey"], "field_b":[8], "field_c":[88], "field_d":["chomebook"], "field_e":["google"], "field_f":["oneplus"], "version":[5]}
]
When I added them I got no error:
$ curl -X POST -d@corrupt_data.json -H "Content-Type: application/json" http://localhost:4040/indexes/corrupt/documents?wait=true
{"data":"added documents","status":200}
But in the lnx
logs I saw
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO - [ WRITER @ corrupt ][ TRANSACTION 0 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO - [ WRITER @ corrupt ][ TRANSACTION 1 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO - [ WRITER @ corrupt ][ TRANSACTION 2 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO - [ WRITER @ corrupt ][ TRANSACTION 3 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO - [ WRITER @ corrupt ][ TRANSACTION 4 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: thread 'thrd-tantivy-index3' panicked at 'Expected a u64/i64/f64 field, got Str("redbull coke") ', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs:208:14
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]: 0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]: 1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:457:5
Aug 30 01:23:56 torako lnx[635823]: 2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: thread 'thrd-tantivy-index0' panicked at 'Expected a u64/i64/f64 field, got Str("foo") ', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs:208:14
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]: 0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]: 1: std::thread 'panickingthrd-tantivy-index1::' panicked at 'begin_panic_fmtExpected a u64/i64/f64 field, got Str("bar")
Aug 30 01:23:56 torako lnx[635823]: ', at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs::208457::145
Aug 30 01:23:56 torako lnx[635823]: 2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]: 0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]: 1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]: at thread '/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rsthrd-tantivy-index4:' panicked at '457Expected a u64/i64/f64 field, got Str("vodka sprite") :', 5/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs
Aug 30 01:23:56 torako lnx[635823]: :208: 14
Aug 30 01:23:56 torako lnx[635823]: 2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]: thread 'thrd-tantivy-index2' panicked at 'Expected a u64/i64/f64 field, got Str("ginger ale whiskey") ', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs:208: 14
Aug 30 01:23:56 torako lnx[635823]: 0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]: 1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:457:5
Aug 30 01:23:56 torako lnx[635823]: 2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]: 0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]: 1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]: at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:457:5
Aug 30 01:23:56 torako lnx[635823]: 2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I'm not sure if these two errors are related, but it seems lnx
's understanding of the fields and tanvity
's aren't in sync.
As of right now you can only submit data to LNX over HTTP which makes it quite tedious for large dumped indexes where half your time is waiting on DIskIO to even read the data and send it over the network rather than for lnx to process. Similar to postgres' csv import system I think it would be a good idea.
LNX Version: 8d38d38
This is an issue that happened between e980494...8d38d38
$ cat a.json
{
"name": "my-index",
"writer_buffer": 6000000,
"writer_threads": 1,
"reader_threads": 1,
"max_concurrency": 10,
"search_fields": [
"title"
],
"storage_type": "memory",
"use_fast_fuzzy": false,
"strip_stop_words": false,
"set_conjunction_by_default": false,
"fields": {
"title": {
"type": "text",
"stored": true
},
"description": {
"type": "text",
"stored": true
},
"id": {
"type": "u64",
"indexed": true,
"stored": true,
"fast": "single"
},
"ts": {
"type": "date",
"stored": true,
"indexed": true,
"fast": "single"
}
},
"boost_fields": {}
}
$ curl -X POST [email protected] -H "Content-Type: application/json" http://127.0.0.1:4040/indexes
{"data":"index created","status":200}
$ cat d.json
{
"id": {"type": "u64", "value": [4]}
}
$ curl -X DELETE [email protected] -H "Content-Type: application/json" http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"invalid JSON body: Failed to parse the request body as JSON","status":400}
Aug 30 19:01:42 torako lnx[866518]: [2021-08-30][19:01:42] | tantivy::indexer::segment_updater | INFO - Running garbage collection
Aug 30 19:01:42 torako lnx[866518]: [2021-08-30][19:01:42] | tantivy::directory::managed_directory | INFO - Garbage collect
Aug 30 19:01:42 torako lnx[866518]: [2021-08-30][19:01:42] | engine::index::writer | INFO - [ WRITER @ my-index ][ TRANSACTION 4 ] completed operation COMMIT
Aug 30 19:01:49 torako lnx[866518]: [2021-08-30][19:01:49] | engine::index::reader | INFO - [ SEARCH @ my-index ] took 120.259µs with limit=20, mode=Normal and 1 results total
Aug 30 19:07:26 torako lnx[866518]: [2021-08-30][19:07:26] | lnx::routes | WARN - rejecting request due to invalid body: InvalidJsonBody(Error { inner: Error("invalid type: map, expected a string or u32", line: 1, column: 13) })
Aug 30 19:07:51 torako lnx[866518]: [2021-08-30][19:07:51] | lnx::routes | WARN - rejecting request due to invalid body: InvalidJsonBody(Error { inner: Error("invalid type: map, expected a string or u32", line: 1, column: 13) })
This has been something on my mind for a little while, overall this would be more benificial passing data via a JSON body as opposed to query parameters, this would make it cleaner to construct queries and make it smoother to add future extensions e.g. Combination queries.
As of right now, the server does not return the original query and the 'processed' query. This could be especially useful for things like fuzzy search where you might want a "Did you mean..." type effect. It also allows the developer to see how the corrections have changed the original query.
At the moment you're stuck with large indexes pending on the connection until the operation is done, this can be fine for most people but some may not want to wait on the request all that time, some ability to spawn the operation as a task would be nice.
This is currently a limitation related to the sorting system. We don't support sorting with multi-value fast fields because it starts to get quite overly complicated.
If we do want sorting by multi-value fields then we should probably decide how we want it to behave first.
At the moment you're required to define the writer buffer size and thread count, while this is mostly fine it can be quite tedious and for people who don't need all the performance they can get, we can probably just go with a set of sensible defaults instead.
For the thread count, I think a system similar to Tantivy's writer() defaults where it's either n CPU cores or a set max (8
), whatever is lower.
For buffer size it's a bit harder I think it should be a percentage of the total amount of RAM on the server it's running on or the minimum whatever is higher, this allows us to make good use of the RAM available without causing a large load on the server itself.
For example, if we have 16GB
of memory available and we allow a budget of 10% of that memory we have a buffer budget of 1.6GB of memory .
In the case of there not being enough memory, we should cut down the number of threads available until the bare minimum is available.
This issue is not present in 0.7.0, but present in 0.8.0
Error log detail
2022-01-16T14:30:18.000995Z ERROR error during lnx runtime: failed to load existing indexes due to error Failed to acquire Lockfile: LockBusy. Some("Failed to acquire index lock. If you are using a regular directory, this means there is already an `IndexWriter` working on this `Directory`, in this process or in a different process.")
Reproduce step:
Restart pod or recreate pod. k8s will send SIGTERM to pod. But this issue is not present in 0.7.0
In k8s I can add preStop hook to execute kill -SIGINT $(pgrep lnx)
to send CTRL-C
to lnx process, but if lnx is paniced or crashed, how can I unlock the writer index? For example, if there is a lock file could be removed before lnx start: rm -rf index/some-lock-file && lnx
One of things that we likely want for distribution support is the ability to set lnx to be read-only so that the system can freely sync files across nodes.
Using words with special characters with fuzzy search does not give any result.
I put two document into the index (together with a lot of others). One containing the German word "Fußbodenheizung", whith contains a special character 'ß'. And another one with a slightly wrong spelling "Fusbodenheizung".
when index was created using use_fast_fuzzy a fuzzy query does not give the expected result:
{
"query": {
"fuzzy": { "ctx": "Fußbodenheizung" }
}
}
-> no hits
searching without the special character finds one hit instead of two:
{
"query": {
"fuzzy": { "ctx": "Fubodenheizung" }
}
}
-> one hit "Fusbodenheizung"
when the index was created using use_fast_fuzzy=false
the expected behavior is given:
{
"query": {
"fuzzy": { "ctx": "Fußbodenheizung" }
}
}
-> two hits "Fußbodenheizung" and "Fusbodenheizung"
and for the query
{
"query": {
"fuzzy": { "ctx": "Fubodenheizung" }
}
}
-> two hits "Fußbodenheizung" and "Fusbodenheizung"
{
"query": {
"normal": { "ctx": "Fußbodenheizung" }
}
}
-> one hit "Fußbodenheizung"
As of right now, there's no snapshot support for lnx, although it's not massively difficult to set up a system to take snapshots of the current index state it might be that you only want to snapshot particular indexes.
A snapshot endpoint should bundle all of the indexes' data and metadata together and compress it.
Then lnx should be able to load from a given snapshot e.g. decompress and organize the structure.
At the moment we use a general term query which works quite well but can lead to things like cars 3 coming before cars 2 when you put "cars 2" as the query. This is because the frequency of the words tends to over power the position of the words, something like another phrase query would work but that requires additional work to get a reasonable relevancy.
When removing a persistent index, tantivy takes a while to fully close up the directory so we can remove it. Because of this we can sometime run into a permissions error as we immediately try to recursively delete the folder.
Currently, the only way to select documents based on some fast field range is via the query parser, which while this works, can lead to it being not massively dev friendly if you want to use lnx for user-facing searches with a range.
The addition would add a new range
query kind which would follow the same format as the existing term
query.
In German language it is common to combine nouns without whitespace.
e.g.
apple => Apfel
tree => Baum
apple tree => Apfelbaum (no white space between the two words)
Having that said, searching for "Baum" (tree) should also give a hit for the apple tree. If there are documents with "Baum" and "Apfelbaum" then user may expect that the document with "Baum" is higher ranked, but they also expect to find "Apfelbaum" within the result.
In Elasticsearch there is a HyphenationCompoundWordTokenFilter that split words by using a hyphon ruleset and a word list. The hyphon ruleset helps to avoid splitting words in a wrong way and may speedup the search for words within other words.
Anyway any simple tokenizer that uses a word list to split the words would help a lot.
Regular backups are currently possible, but hard on a per-index basis. A dedicated snapshot system would help this issue and remove the need for a 3rd party tool/script.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.