zhihu / rucene Goto Github PK
View Code? Open in Web Editor NEWRust port of Lucene
License: Apache License 2.0
Rust port of Lucene
License: Apache License 2.0
Hello fellow Rustacean,
we (Rust group @sslab-gatech) found a memory-safety/soundness issue in this crate while scanning Rust code on crates.io for potential vulnerabilities.
rucene/src/core/store/io/data_input.rs
Lines 214 to 220 in 5b55f84
core::store::io::data_input::DataInput::read_string()
method creates an uninitialized buffer and passes it to user-provided Read
implementation. This is unsound, because it allows safe Rust code to exhibit an undefined behavior (read from uninitialized memory).
This part from the Read
trait documentation explains the issue:
It is your responsibility to make sure that
buf
is initialized before callingread
. Calling read with an uninitializedbuf
(of the kind one obtains viaMaybeUninit<T>
) is not safe, and can lead to undefined behavior.
The Naive & safe way to fix the issue is to always zero-initialize a buffer before lending it to a user-provided Read
implementation. Note that this approach will add runtime performance overhead of zero-initializing the buffer.
As of Feb 2021, there is not yet an ideal fix that works with no performance overhead. Below are links to relevant discussions & suggestions for the fix.
https://github.com/tantivy-search/tantivy Is also a library to port Lucene in rust, did you compare the features between this project and tantivy?
just building example code and getting error below,
error[E0599]: no method named `get_ref` found for union `MaybeUninit` in the current scope
--> /home/oz-mint/.cargo/registry/src/github.com-1ecc6299db9ec823/rucene-0.1.1/src/core/search/query/spans/span_near.rs:509:45
|
509 | } else if self.conjunction_span.get_ref().one_exhausted_in_current_doc {
| ^^^^^^^ method not found in `MaybeUninit<ConjunctionSpanBase<P>>`
|
::: /home/oz-mint/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/pin.rs:804:18
|
804 | pub const fn get_ref(self) -> &'a T {
| ------- the method is available for `Pin<&MaybeUninit<span::ConjunctionSpanBase<P>>>` here
|
help: consider wrapping the receiver expression with the appropriate type
|
509 | } else if Pin::new(&self.conjunction_span).get_ref().one_exhausted_in_current_doc {
| ++++++++++ +
Some errors have detailed explanations: E0308, E0554, E0599, E0635.
For more information about an error, try `rustc --explain E0308`.
error: could not compile `rucene` due to 67 previous errors
any suggestions ?
here https://github.com/ozkanpakdil/rust-examples/tree/main/rucene_test
Context: I am adding rucene to https://github.com/tantivy-search/search-benchmark-game.
It is a search benchmarking comparing Lucene, Tantivy, Bleve and now Rucene.
Indexing works but I have to periodically commit to avoid getting a panic.
See the following two lines of code and comment.
https://github.com/tantivy-search/search-benchmark-game/blob/master/engines/rucene-0.1/src/bin/build_index.rs#L103-L104
(I suspect a u32
overflow)
The search benchmark consists in indexing all docs in wikipedia en.
To level the field, we merge all segments down to a single segment.
I was happy to see that rucene also implemented force_merge
with the blocking option.
Unfortunately after the merge finish, I end up with an index of 24 GB.
(Tantivy and Lucene both end up with an index of 3GB.)
Failing to build on latest nightly, but also when using rustup run nightly-2019-10-28 cargo build
:
error[E0658]: `cfg(doctest)` is experimental and subject to change
--> /Users/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/memoffset-0.5.6/src/lib.rs:74:7
|
74 | #[cfg(doctest)]
| ^^^^^^^
|
= note: for more information, see https://github.com/rust-lang/rust/issues/62210
= help: add `#![feature(cfg_doctest)]` to the crate attributes to enable
error[E0658]: `cfg(doctest)` is experimental and subject to change
--> /Users/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/memoffset-0.5.6/src/lib.rs:77:7
|
77 | #[cfg(doctest)]
| ^^^^^^^
|
= note: for more information, see https://github.com/rust-lang/rust/issues/62210
= help: add `#![feature(cfg_doctest)]` to the crate attributes to enable
error: aborting due to 2 previous errors
For more information about this error, try `rustc --explain E0658`.
error: could not compile `memoffset`.
warning: build failed, waiting for other jobs to finish...
error: build failed
Looks correct to me?
rustc --version
rustc 1.40.0-nightly (95f437b3c 2019-10-27)
用master分支build了一个索引,读取segments内容后发现版本是6.4.18
? 这个版本对应兼容lucene哪个版本呢?
>> read(segments_1)
header length: 35
lucene version: 6.4.18
version: 4
nameCounter: 1
segCount: 1
...
完全兼容
原生lucene的吗?Phrase query faills with a panic when running on 10_000 wikipedia docs
See the following commented out code.
https://github.com/tantivy-search/search-benchmark-game/blob/master/engines/rucene-0.1/src/bin/do_query.rs#L126-L128
the difference is:
if it's a whole engine of it, we may need to deploy multiple instances by maybe docker or some to ensure the HA and distribution then.
if it just a lib for the engine, (like the lucene against elasticsearch/solr) do we have plan to make the whole engine open source then?
The following code prints an arbitrary number, because the vec has already been dropped. I can also get aliasing &mut's to the same value by simply calling p.longs() multiple times, since it takes &self.
A fix to this would be to store &mut Vec and have a lifetime parameter inside LongsPtr, and have .longs()
take &mut self
fn main() {
let p = rucene::core::util::LongsPtr::new(&mut vec![15], 0, 0);
dbg!(p.longs()[0]);
}
Rucene also fails to build with the stable channel (due to the use of #[feature]
), so I tried the nightly release:
$ cargo build
Compiling rucene v0.1.0
error[E0432]: unresolved import `std::boxed::FnBox`
--> /Users/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/rucene-0.1.0/src/core/util/thread_pool.rs:27:5
|
27 | use std::boxed::FnBox;
| ^^^^^^^^^^^^^^^^^ no `FnBox` in `boxed`
$ rustc --version
rustc 1.41.0-nightly (59947fcae 2019-12-08)
It'd be great if we can work with the latest Lucene
Lucene has a lucene-backward-codecs library.
In trying to run a Lucene 8 shard, I hit:
Error: Error(CorruptIndex("index format either too new or too old: 4 <= 9 <= 6 doesn\'t hold"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })
Any plans to expand the supported codecs? Could you document which codecs are supported currently? Based on above, I assume between 4 and 6.
This might be a silly question, does Rucene support Chinese character indexing and searching.
I don't see any tokenizer under the https://github.com/zhihu/rucene/tree/master/src/core/analysis
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.