lnx-search / lnx Goto Github PK

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable, typo tollerant deployment of the tantivy search engine.

Home Page: https://lnx.rs

License: MIT License

Rust 100.00%

search search-engine tantivy tokio rust instant database

lnx's People

Contributors

Stargazers

Watchers

Forkers

lnicola miyachan isgasho zhangkg c0c1 xiaoyaoyouzizai tlntin zenria kandyjam liuwqiang curioustauseef jurabek onerandomusername cyberflamego tempbottle hominee ralphsu pinghe vvhh2002 oka-tan icodein saroh shnwang bridgecrew-perf7 matadorhong gismo-stack gg-big-org dboures placrosse borfi yehia2amer revertron iq-scm pseitz kfabryczny chillfish8 getong attack-tiger renanvieira hmthanh offthewidow creativeagi favbox

lnx's Issues

Metrics & stats

Generally a pretty big thing would be to add Prometheus metrics allowing people to use things like grafana to track in flight connections, latencies, etc... This could probably be implemented along side the telemetry data issue.

Synonym support

At the moment theirs a distinct lack of synonym support and generally im not sure how to go about implementing this short of some wildly inefficient system.

Why is every doc field wrapped in an array in search results?

For example:

{
  "status": 200,
  "data": {
    "hits": [
      {
        "doc": {
          "author": [
            "248b2e6a-7c36-4da3-bcc4-55a979eb57dc"
          ],
          "id": [
            18
          ],
          "title": [
            "title 01"
          ],
          "uuid": [
            "06dbf5c7-d313-413d-8f65-49aed93e4031"
          ]
        },
        "document_id": "1628525110829290421",
        "score": 1.542423
      },
      {
        "doc": {
          "author": [
            "248b2e6a-7c36-4da3-bcc4-55a979eb57dc"
          ],
          "id": [
            19
          ],
          "title": [
            "title 02"
          ],
          "uuid": [
            "8da05387-8727-4a27-baa7-265af7558c0c"
          ]
        },
        "document_id": "1493516234521670736",
        "score": 1.542423
      },
      {
        "doc": {
          "author": [
            "248b2e6a-7c36-4da3-bcc4-55a979eb57dc"
          ],
          "id": [
            20
          ],
          "title": [
            "title 03"
          ],
          "uuid": [
            "3bf64ee1-f2ac-46ce-8e45-0d25956b195c"
          ]
        },
        "document_id": "9603160257558085701",
        "score": 1.542423
      }
    ],
    "count": 3,
    "time_taken": 0.000578893
  }
}

I think it would make much more sense to show the doc as it has been posted.

Search Results cannot be sorted by a date field

Attempting to perform a search where the order_by field is a date leads to an error:

{"data":"Schema error: 'Field \"ts\" is of type I64!=Date'","status":400}

It looks this is because the FieldValue is implied to be i64:

lnx/engine/src/index/reader.rs

Lines 549 to 552 in 8d38d38

    
           FieldType::Date(_) => { 
        
               let out: (Vec<(i64, DocAddress)>, usize) = 
        
                   order_and_search!(searcher, collector, field, &query, executor)?; 
        
               (process_search!(searcher, schema, out.0), out.1)

Incorrect git url on docs

The docs still target

Download the file via git clone https://github.com/ChillFish8/lnx.git

which is incorrect, this should be changed to the new repo url

Index not updating / adding document

So in this bug, you are right that was the cause. I copied this example from the book and assumed I was hitting the same issue. What I am actually seeing is that when I index a document with a date field, I am no longer able to index any more documents.

# curl -X DELETE  'http://localhost:4040/indexes/my-index'
{"data":"index deleted","status":200}# 
# cat a.json
{
  "name": "my-index",
  "writer_buffer": 6000000,
  "writer_threads": 1,
  "reader_threads": 1,
  "max_concurrency": 10,
  "search_fields": [
    "title"
  ],
  "storage_type": "memory",
  "use_fast_fuzzy": false,
  "strip_stop_words": false,
   "set_conjunction_by_default": false,
  "fields": {
    "title": {
      "type": "text",
      "stored": true
    },
    "description": {
      "type": "text",
      "stored": true
    },
    "id": {
      "type": "u64",
      "indexed": true,
      "stored": true,
      "fast": "single"
    },
 "ts": {
            "type": "date",
            "stored": false,
            "indexed": true,
            "fast": "single"
        }
  },
  "boost_fields": {}
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://127.0.0.1:4040/indexes
{"data":"index created","status":200}
# cat c.json
{
    "title": ["Hello, World2"],
"id":[4]
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"added documents","status":200}
# curl -X POST 'http://localhost:4040/indexes/my-index/commit'
{"data":"changes committed","status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=*&mode=normal'
{"data":{"count":1,"hits":[{"doc":{"id":[4],"title":["Hello, World2"]},"document_id":"8295453496340348446","ratio":1.0}],"time_taken":0.0001392010017298162},"status":200}
# cat b.json
{
    "title": ["Hello, World2"],
"id":[4],
"ts":[1630097583]
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"added documents","status":200}
# curl -X POST 'http://localhost:4040/indexes/my-index/commit'
{"data":"changes committed","status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=*&mode=normal'
{"data":{"count":1,"hits":[{"doc":{"id":[4],"title":["Hello, World2"]},"document_id":"8295453496340348446","ratio":1.0}],"time_taken":0.0001936009939527139},"status":200}

Adding a document with a date field doesn't produce an error, but seems to corrupt the index. In my original setup, I always had a date field, and I wasn't seeing any documents get indexed, which is why I assumed the two errors were the same. Once this happens even documents without the ts field fail to be indexed.

Originally posted by @miyachan in #14 (comment)

Suggestion: Downgrade log message in handle_msg to trace

I'm not sure if you are just suppressing all log output - but in my testing I found that the log message in handle_msg (

lnx/engine/src/index/writer.rs

Line 102 in f290cb8

info!(

) is too chatty. It logs every operation that happens on the index, and when you are indexing millions of docs it generates a ton of logs (~100GB of logs in my case!). All that disk writing also severely affects performance - my throughput went from 20k docs/s to 40k docs/s by changing info! to trace!. I don't find that the messages are all that useful enough to be info! either.

Support from compressed payloads

You should be able to send Gzipped data across to lnx to save network bandwidth and increase the transfer rate.

Add sentence correction / suggestion endpoint via fast-fuzzy

Seeing that we can do fuzzy matching a spell correction with the fast fuzzy system. We can produce a set of results that have been corrected with the context of the corpus data. This could potentially be incredibly useful for getting more accurate results e.g.
"th trueman shew" would become "the truman show" according to the movies dataset.

default search fields not being used

For some reason when defining an index and providing given search fields. The query parser isn't being set to use them.

Move to dynamic excutor

Currently, lnx will create a set of rayon thread pools, this is fine for most cases but at higher concurrency levels this can start eating up an awfully large amount of CPU at idle which is fairly wasteful.

The solution would be a dynamically sized pool that grows and shrinks with the load up to a limit, this can help keep the usage size down.

Staged searches - Only run query x if query y is out of results.

Expose `set_conjunction_by_default` to query API?

By default the tantivy query parser treats multiple terms in the query as OR terms - meaning a query like barack obama will match documents containing only barack or only obama. Sometimes its desirable to only score documents in which barack and obama are present in a user facing search. Tantivy provides this functionality with the set_conjunction_by_default parameter.

I think this could work as an index option or a query time query parameter.

Data corrupted when using fast_serach mode: "Failed to open field \"title\"'s term dictionary in the compos (truncated...)"

Using master/0.9 beta and a INDEX with "use_fast_fuzzy": true results in corrupted data.

minimum example

1. create index with use_fast_fuzzy set to true:

{
    "override_if_exists": true,
    "index": {
        "name": "products",
        "storage_type": "tempdir",
        "fields": {
            "title": {
                "type": "text",
                "stored": true
            }
        },
        "search_fields": [],
        "boost_fields": {},
        "reader_threads": 1,
        "max_concurrency": 1,
        "writer_buffer": 300000,
        "writer_threads": 1,
        "set_conjunction_by_default": false,
        "use_fast_fuzzy": true,
        "strip_stop_words": false,
        "auto_commit": 0
    }
}

2. send a document missing one of the defined fields, or having the vaule a empty sting or "-" or "_"

e.g: POST /indexes/products/documents with body
{"title":""}

3.

POST /indexes/products/commit
I get a error message:
{"status":400,"data":"Data corrupted: 'Data corruption: : Failed to open field "title"'s term dictionary in the compos (truncated...)

workaround / recover

to recover from this I rebuild the index and add the document with a dummy value
or I rebuild the index without using "use_fast_fuzzy".

Use case

The example above is not the real use case, usually titles on all my documents are set but they have some optional fields that are optional.

Dead Link to Book / Missing link to doku

The Documentation that is available at https://docs.lnx.rs/ should be linked from the readme file and the old link to the book should be removed.

master branch memory leak

Running heavy tests with large datasets master currently shows signs of a memory leak when running the 50 million amazon dataset.

0.7.1 Does not have this issue and successfully runs at ~9GB memory usage max when indexing the dataset and 4.8GB when complete.
The total data is ~26GB but ram usage on master crept up to 62GB before being killed by OOM.

Storage system cleanup

As of right now we use the Tantivy system to persist to disk which works well, the issue is we have several of these directories instead of having one manager that everything uses, this doesn't work well for maintainability or file management so this should probably change.

Horozonally scaling readers

Distribution support

As of right now we cannot scale horizontally / do multi machine scaling at all. This is a big issue for larger work loads and really would be a good idea to implement, I've been looking at RAFT based setup which seem to work well but more research is needed.

Invalid queries eventually exhaust the executor ArrayQueue causing a panic

Lnx Version: e9804944edc8a7c0af24ee3ba8397b87f1640b5f

I'm trying to somehow reproduce this as I'm not sure how it occurred. I have a system which adds documents to the index and commits every 10s. I was executing searches against the system (specifically I was testing which queries might cause an error, not sure if this is related):

$ curl 'http://localhost:4040/indexes/posts/search?query=text:f^oobar&mode=normal&limit=50&order_by=-ts'
{"data":"Syntax Error","status":400}
$ curl 'http://localhost:4040/indexes/posts/search?query=text:f`oobar&mode=normal&limit=50&order_by=-ts'
{"data":"channel closed","status":400}

logs:

Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO  - Deleted "8ae9f9e93c674678ae3e7ab694752231.fast"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO  - Deleted "31b0bab77e014d539022907d36eac93c.fieldnorm"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO  - Deleted "2dbb35c78423479290186b0fccb9b48e.fieldnorm"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::managed_directory | INFO  - Deleted "90fcfd004ee34f3892332a95d9c260e1.fast"
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210320997 ] completed operation DELETE-TERM
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210320998 ] completed operation ADD-DOCUMENT
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210320999 ] completed operation ADD-DOCUMENT
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | tantivy::directory::file_watcher | INFO  - Meta file "./lnx/index-data/posts/meta.json" was modified
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210321000 ] completed operation DELETE-TERM
Aug 30 00:52:18 torako lnx[307715]: [2021-08-30][00:52:18] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210321001 ] completed operation ADD-DOCUMENT
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210321002 ] completed operation DELETE-TERM
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210321003 ] completed operation ADD-DOCUMENT
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | engine::index::writer | INFO  - [ WRITER @ posts ][ TRANSACTION 210321004 ] completed operation ADD-DOCUMENT
Aug 30 00:52:19 torako lnx[307715]: thread 'index-posts-worker-0' panicked at 'get executor', /root/lnx/engine/src/index/reader.rs:264:44
Aug 30 00:52:19 torako lnx[307715]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aug 30 00:52:19 torako lnx[307715]: [2021-08-30][00:52:19] | lnx::routes | WARN  - rejecting search index operation due to bad request: channel closed

My index settings are like so:

{
"writer_buffer": 144000000,
"writer_threads": 6,
"reader_threads": 6,

"max_concurrency": 12
}

Is it possible I exceeded the amount of concurrent requests allowed?

Facet support

There is none, we should add some.

refractor index code and persistance code.

Currently, the system is fairly messy when re-creating schemas, this leads to the issue of #19 because of the conflicting schemas between our loaded schema and tantivy's schema.

Traditionally this doesn't cause any issues for temporary structures e.g. memory or tempfile but can be caused when we do a persistent index which leads to tantivy having one schema and us having the other. Which leads to some weird behavour.

Use MiMalloc Allocator

The musl allocator has a bit of a legacy with being slower than most other allocators, we cant use JeMalloc due to some compile issues and it also creates a bit of a desync of performance across operating systems which i'd like to avoid.

MiMalloc supports both Unix and Win systems so it's probably work testing and viewing how this affects usage and performance.
If it's adequate it's probably a good idea to use it.

Telemetry data

Although im personally not a fan of this it's certainly needed to be able to get a good idea of the areas to focus on, I think.
The data collected only really needs to be the average length of queries, type of query and amount of docs (plus index runtime settings) but thats about it, users should be able to opt out just by passing a flag e.g. --no-telemetry

multi_match query kind

This allows for alot of de-duplication of combination queries where you might want to apply the same terms to multiple fields.

I think it's also a good idea to add a general * specialisation to allow for searching on all default fields.

Querying returns no results?

I can't seem to get any results from lnx. I'm using commit e9804944edc8a7c0af24ee3ba8397b87f1640b5f. I built lnx using cargo build --release, then starting it with /usr/local/bin/lnx -p 4040.

# cat a.json
{
  "name": "my-index",
  "writer_buffer": 6000000,
  "writer_threads": 1,
  "reader_threads": 1,
  "max_concurrency": 10,
  "search_fields": [
    "title"
  ],
  "storage_type": "memory",
  "use_fast_fuzzy": true,
  "strip_stop_words": true,
  "fields": {
    "title": {
      "type": "text",
      "stored": true
    },
    "description": {
      "type": "text",
      "stored": true
    },
    "id": {
      "type": "u64",
      "indexed": true,
      "stored": true,
      "fast": "single"
    }
  },
  "boost_fields": {
    "title": 2,
    "description": 0.8
  }
}
# curl -X POST [email protected] -H "Content-Type: application/json" http://127.0.0.1:4040/indexes
{"data":"index created","status":200}
# cat b.json
{
    "title": ["Hello, World"],
    "description": ["Welcome to the next generation system."]
}
# curl -X POST -H "Content-Type: application/json" [email protected] http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"added documents","status":200}
# curl -X POST 'http://localhost:4040/indexes/my-index/commit'
{"data":"changes committed","status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=*'
{"data":{"count":0,"hits":[],"time_taken":0.001682035974226892},"status":200}
# curl 'http://localhost:4040/indexes/my-index/search?query=Hello'
{"data":{"count":0,"hits":[],"time_taken":0.00014333099534269422},"status":200}

I can't figure out what I'm doing wrong here.

Adaptable query parser

Mounting volume overwrites binary

Hiya!

I’m attempting to run LNX on Kubernetes as a stateful set, but running into an issue - the docs/example code suggests mounting a volume to etc/lnx but doing this in Kubernetes causes the contents of that path to be replaced with the attached volume, which means the binary can’t be found.

Is there a way to parametise the storage path so that it doesn’t collide with binary path? 🙂

Have one writer actor per engine that manages all index writers.

As it stands right now, you can create n indexes which each create at a bare minimum 2-3 threads, this isn't great for efficiency both at idle and indexing.

If you create an index with a 12 thread writer and then another of the same number of threads then add documents to them, the system will essentially fight itself over the CPU time, this is not massively ideal slowing both indexing operations down vs queuing the operations one after another.

This not only improves the per index indexing performance but also makes indexes much cheaper to make dropping to only 1 thread in a best-case scenario, the #37 issue would basically make this 0 potentially allowing the creation of 'micro' indexes.

Combination queries

At the moment you can do 1 of the 3 (technically 4) query options, while this is fine for simple use, fuzzy searching alone may not be enough to fully filter the relevant docs, a combination setup would allow users to query the given fields via fuzzy search and the traditional queries like ranges, date lookups etc...

Schema metadata corrupted after restart

LNX: e980494

I'm trying to figure out how to reproduce this. After the panic I reported in #18, after restarting lnx, I noticed that my index schema that was being returned from the search query was messed up. Searching seems to work as advertised (as in searching field_a:foo, will return documents where foo is set in field_a, but the schema of the search results is messed up).

For example, let's say I have schema where I am only storing 3 fields (field_a, field_b, field_c), but I am indexing 6. before the restart (example)

$ curl 'http://localhost:4040/indexes/posts/search?query=field_a:foo&mode=normal&limit=50&order_by=-ts'
{"data":{"count":40,"hits":[{"doc":{"field_a":["foo"],"field_b":[4],"field_c":[44]}, # etc

Now that same query is returning:

$ curl 'http://localhost:4040/indexes/posts/search?query=field_a:foo&mode=normal&limit=50&order_by=-ts'
{"data":{"count":40,"hits":[{"doc":{"field_d":["foo"],"field_e":[4],"field_f":[44]}, # etc

The values are correct, but the name of the keys are completely different.

However, in trying to reproduce the error, it seems like my lnx install is corrupted. I tried to to create an index like so:

{
    "name": "corrupt",

    "writer_buffer": 144000000,
    "writer_threads": 12,
    "reader_threads": 12,

    "max_concurrency": 24,
    "search_fields": [
        "field_a"
    ],

    "storage_type": "filesystem",
    "set_conjunction_by_default": true,
    "use_fast_fuzzy": false,
    "strip_stop_words": false,

    "fields": {
        "field_a": {
            "type": "text",
            "stored": true
        },
        "field_b": {
           "type": "u64",
           "stored": true,
           "indexed": true,
           "fast": "single"
        },
        "field_c": {
           "type": "u64",
           "stored": true,
           "indexed": true,
           "fast": "single"
        },
        "field_d": {
            "type": "text",
            "stored": false
        },
        "field_e": {
            "type": "text",
            "stored": false
        },
        "field_f": {
            "type": "text",
            "stored": false
        },
        "version": {
            "type": "u64",
            "stored": false,
            "indexed": true,
            "fast": "single"
        }
    },
    "boost_fields": {}
}

then index these documents:

[
    {"field_a":["foo"], "field_b":[4], "field_c":[44], "field_d":["macbook"], "field_e":["apple"], "field_f":["iphone"], "version":[1]},
    {"field_a":["bar"], "field_b":[5], "field_c":[55], "field_d":["laptop"], "field_e":["micrsoft"], "field_f":["galaxy"], "version":[2]},
    {"field_a":["redbull coke"], "field_b":[6], "field_c":[66], "field_d":["thinkpad"], "field_e":["netflix"], "field_f":["nexus"], "version":[3]},
    {"field_a":["vodka sprite"], "field_b":[7], "field_c":[77], "field_d":["ultrabook"], "field_e":["facebook"], "field_f":["blackberry"], "version":[4]},
    {"field_a":["ginger ale whiskey"], "field_b":[8], "field_c":[88], "field_d":["chomebook"], "field_e":["google"], "field_f":["oneplus"], "version":[5]}
]

When I added them I got no error:

 $ curl -X POST -d@corrupt_data.json -H "Content-Type: application/json" http://localhost:4040/indexes/corrupt/documents?wait=true
{"data":"added documents","status":200}

But in the lnx logs I saw

Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO  - [ WRITER @ corrupt ][ TRANSACTION 0 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO  - [ WRITER @ corrupt ][ TRANSACTION 1 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO  - [ WRITER @ corrupt ][ TRANSACTION 2 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO  - [ WRITER @ corrupt ][ TRANSACTION 3 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: [2021-08-30][01:23:56] | engine::index::writer | INFO  - [ WRITER @ corrupt ][ TRANSACTION 4 ] completed operation ADD-DOCUMENT
Aug 30 01:23:56 torako lnx[635823]: thread 'thrd-tantivy-index3' panicked at 'Expected a u64/i64/f64 field, got Str("redbull coke") ', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs:208:14
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]:    0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]:    1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:457:5
Aug 30 01:23:56 torako lnx[635823]:    2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: thread 'thrd-tantivy-index0' panicked at 'Expected a u64/i64/f64 field, got Str("foo") ', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs:208:14
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]:    0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]:    1: std::thread 'panickingthrd-tantivy-index1::' panicked at 'begin_panic_fmtExpected a u64/i64/f64 field, got Str("bar")
Aug 30 01:23:56 torako lnx[635823]: ',              at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs::208457::145
Aug 30 01:23:56 torako lnx[635823]:    2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]:    0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]:    1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]:              at thread '/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rsthrd-tantivy-index4:' panicked at '457Expected a u64/i64/f64 field, got Str("vodka sprite") :', 5/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs
Aug 30 01:23:56 torako lnx[635823]: :208: 14
Aug 30 01:23:56 torako lnx[635823]: 2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]: thread 'thrd-tantivy-index2' panicked at 'Expected a u64/i64/f64 field, got Str("ginger ale whiskey") ', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.16.0/src/fastfield/mod.rs:208: 14
Aug 30 01:23:56 torako lnx[635823]:  0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]:    1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:457:5
Aug 30 01:23:56 torako lnx[635823]:    2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 30 01:23:56 torako lnx[635823]: stack backtrace:
Aug 30 01:23:56 torako lnx[635823]:    0: rust_begin_unwind
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:515:5
Aug 30 01:23:56 torako lnx[635823]:    1: std::panicking::begin_panic_fmt
Aug 30 01:23:56 torako lnx[635823]:              at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/std/src/panicking.rs:457:5
Aug 30 01:23:56 torako lnx[635823]:    2: tantivy::indexer::index_writer::index_documents
Aug 30 01:23:56 torako lnx[635823]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I'm not sure if these two errors are related, but it seems lnx's understanding of the fields and tanvity's aren't in sync.

Import data from local file directly.

As of right now you can only submit data to LNX over HTTP which makes it quite tedious for large dumped indexes where half your time is waiting on DIskIO to even read the data and send it over the network rather than for lnx to process. Similar to postgres' csv import system I think it would be a good idea.

Delete Endpoint always fails

LNX Version: 8d38d38

This is an issue that happened between e980494...8d38d38

$ cat a.json
{
  "name": "my-index",
  "writer_buffer": 6000000,
  "writer_threads": 1,
  "reader_threads": 1,
  "max_concurrency": 10,
  "search_fields": [
    "title"
  ],
  "storage_type": "memory",
  "use_fast_fuzzy": false,
  "strip_stop_words": false,
   "set_conjunction_by_default": false,
  "fields": {
    "title": {
      "type": "text",
      "stored": true
    },
    "description": {
      "type": "text",
      "stored": true
    },
    "id": {
      "type": "u64",
      "indexed": true,
      "stored": true,
      "fast": "single"
    },
 "ts": {
            "type": "date",
            "stored": true,
            "indexed": true,
            "fast": "single"
        }
  },
  "boost_fields": {}
}
$ curl -X POST [email protected] -H "Content-Type: application/json" http://127.0.0.1:4040/indexes
{"data":"index created","status":200}
$ cat d.json
{
    "id": {"type": "u64", "value": [4]}
}
$ curl -X DELETE [email protected] -H "Content-Type: application/json" http://localhost:4040/indexes/my-index/documents?wait=true
{"data":"invalid JSON body: Failed to parse the request body as JSON","status":400}

Logs:

Aug 30 19:01:42 torako lnx[866518]: [2021-08-30][19:01:42] | tantivy::indexer::segment_updater | INFO  - Running garbage collection
Aug 30 19:01:42 torako lnx[866518]: [2021-08-30][19:01:42] | tantivy::directory::managed_directory | INFO  - Garbage collect
Aug 30 19:01:42 torako lnx[866518]: [2021-08-30][19:01:42] | engine::index::writer | INFO  - [ WRITER @ my-index ][ TRANSACTION 4 ] completed operation COMMIT
Aug 30 19:01:49 torako lnx[866518]: [2021-08-30][19:01:49] | engine::index::reader | INFO  - [ SEARCH @ my-index ] took 120.259µs with limit=20, mode=Normal and 1 results total
Aug 30 19:07:26 torako lnx[866518]: [2021-08-30][19:07:26] | lnx::routes | WARN  - rejecting request due to invalid body: InvalidJsonBody(Error { inner: Error("invalid type: map, expected a string or u32", line: 1, column: 13) })
Aug 30 19:07:51 torako lnx[866518]: [2021-08-30][19:07:51] | lnx::routes | WARN  - rejecting request due to invalid body: InvalidJsonBody(Error { inner: Error("invalid type: map, expected a string or u32", line: 1, column: 13) })

Move to POST request querying

This has been something on my mind for a little while, overall this would be more benificial passing data via a JSON body as opposed to query parameters, this would make it cleaner to construct queries and make it smoother to add future extensions e.g. Combination queries.

Return original query and corrected query as part of results.

As of right now, the server does not return the original query and the 'processed' query. This could be especially useful for things like fuzzy search where you might want a "Did you mean..." type effect. It also allows the developer to see how the corrections have changed the original query.

Async writes, updates and transactions.

At the moment you're stuck with large indexes pending on the connection until the operation is done, this can be fine for most people but some may not want to wait on the request all that time, some ability to spawn the operation as a task would be nice.

Multi-value field sorting.

This is currently a limitation related to the sorting system. We don't support sorting with multi-value fast fields because it starts to get quite overly complicated.

If we do want sorting by multi-value fields then we should probably decide how we want it to behave first.

Multi-field sorting

Add sensible writer configuration defaults.

At the moment you're required to define the writer buffer size and thread count, while this is mostly fine it can be quite tedious and for people who don't need all the performance they can get, we can probably just go with a set of sensible defaults instead.

For the thread count, I think a system similar to Tantivy's writer() defaults where it's either n CPU cores or a set max (8), whatever is lower.

For buffer size it's a bit harder I think it should be a percentage of the total amount of RAM on the server it's running on or the minimum whatever is higher, this allows us to make good use of the RAM available without causing a large load on the server itself.

For example, if we have 16GB of memory available and we allow a budget of 10% of that memory we have a buffer budget of 1.6GB of memory .

In the case of there not being enough memory, we should cut down the number of threads available until the bare minimum is available.

[BUG][0.8.0] index writer lock issue during k8s pod restart

This issue is not present in 0.7.0, but present in 0.8.0

Error log detail

2022-01-16T14:30:18.000995Z ERROR error during lnx runtime: failed to load existing indexes due to error Failed to acquire Lockfile: LockBusy. Some("Failed to acquire index lock. If you are using a regular directory, this means there is already an `IndexWriter` working on this `Directory`, in this process or in a different process.")

Reproduce step:
Restart pod or recreate pod. k8s will send SIGTERM to pod. But this issue is not present in 0.7.0
In k8s I can add preStop hook to execute kill -SIGINT $(pgrep lnx) to send CTRL-C to lnx process, but if lnx is paniced or crashed, how can I unlock the writer index? For example, if there is a lock file could be removed before lnx start: rm -rf index/some-lock-file && lnx

Read-only node configuration

One of things that we likely want for distribution support is the ability to set lnx to be read-only so that the system can freely sync files across nodes.

No Result - Search for unicode character in fuzy search when using use_fast_fuzzy

Using words with special characters with fuzzy search does not give any result.

how to reproduce

I put two document into the index (together with a lot of others). One containing the German word "Fußbodenheizung", whith contains a special character 'ß'. And another one with a slightly wrong spelling "Fusbodenheizung".

when index was created using use_fast_fuzzy a fuzzy query does not give the expected result:

{
  "query": {
    "fuzzy": { "ctx": "Fußbodenheizung" }
  }
}

-> no hits

searching without the special character finds one hit instead of two:

{
  "query": {
    "fuzzy": { "ctx": "Fubodenheizung" }
  }
}

-> one hit "Fusbodenheizung"

expected

when the index was created using use_fast_fuzzy=false
the expected behavior is given:

{
  "query": {
    "fuzzy": { "ctx": "Fußbodenheizung" }
  }
}

-> two hits "Fußbodenheizung" and "Fusbodenheizung"

and for the query

{
  "query": {
    "fuzzy": { "ctx": "Fubodenheizung" }
  }
}

-> two hits "Fußbodenheizung" and "Fusbodenheizung"

normal query finds one hit as expected:

{
  "query": {
    "normal": { "ctx": "Fußbodenheizung" }
  }
}

-> one hit "Fußbodenheizung"

snapshot support

As of right now, there's no snapshot support for lnx, although it's not massively difficult to set up a system to take snapshots of the current index state it might be that you only want to snapshot particular indexes.

A snapshot endpoint should bundle all of the indexes' data and metadata together and compress it.
Then lnx should be able to load from a given snapshot e.g. decompress and organize the structure.

Custom query system for better relevancy when using fast fuzzy.

At the moment we use a general term query which works quite well but can lead to things like cars 3 coming before cars 2 when you put "cars 2" as the query. This is because the frequency of the words tends to over power the position of the words, something like another phrase query would work but that requires additional work to get a reasonable relevancy.

Index remove permission error

When removing a persistent index, tantivy takes a while to fully close up the directory so we can remove it. Because of this we can sometime run into a permissions error as we immediately try to recursively delete the folder.

Micro indexes

Add range queries.

Currently, the only way to select documents based on some fast field range is via the query parser, which while this works, can lead to it being not massively dev friendly if you want to use lnx for user-facing searches with a range.

The addition would add a new range query kind which would follow the same format as the existing term query.

Decompound Words For German Languange

In German language it is common to combine nouns without whitespace.
e.g.

apple => Apfel
tree => Baum
apple tree => Apfelbaum (no white space between the two words)

Having that said, searching for "Baum" (tree) should also give a hit for the apple tree. If there are documents with "Baum" and "Apfelbaum" then user may expect that the document with "Baum" is higher ranked, but they also expect to find "Apfelbaum" within the result.

In Elasticsearch there is a HyphenationCompoundWordTokenFilter that split words by using a hyphon ruleset and a word list. The hyphon ruleset helps to avoid splitting words in a wrong way and may speedup the search for words within other words.

Anyway any simple tokenizer that uses a word list to split the words would help a lot.

Snapshot Support

Regular backups are currently possible, but hard on a per-index basis. A dedicated snapshot system would help this issue and remove the need for a 3rd party tool/script.

	FieldType::Date(_) => {
	let out: (Vec<(i64, DocAddress)>, usize) =
	order_and_search!(searcher, collector, field, &query, executor)?;
	(process_search!(searcher, schema, out.0), out.1)