zhihu / rucene Goto Github PK

View Code? Open in Web Editor NEW

993.0 30.0 60.0 1.87 MB

Rust port of Lucene

License: Apache License 2.0

Rust 100.00%

lucene rust information-retrival

rucene's People

Contributors

Stargazers

Watchers

Forkers

isgasho unix1986 2892931976 yangzhaocheng zekka automancursor shaunstanislauslau guardbl hhy5277 zhouyuxiang0 rsphing meiking glorv mapbased chrischiedo gengteng cityblack1 ihuangyaoshi atnightly igxactly sinhasantos litao91 pombredanne davidxiaozhi xiaming9880 shuo93 bigxu zhufenggood ra2003 yida-lxw amarjitghuman kitelife eahitechnology shylock-hg jzice chmodawk y-meng pi-pi-miao aparo jtong11 quintintao dobzhao xorshiftgit xyl012 beckbikang sharpboy2008 ajunlonglive ddavisatxtivia chao-huang stevelauc nguyensen louisyw sybblow iq-scm jobdeng javayamato jimichan showntop msfroh

rucene's Issues

`Read` on uninitialized buffer may cause UB

Hello fellow Rustacean,
we (Rust group @sslab-gatech) found a memory-safety/soundness issue in this crate while scanning Rust code on crates.io for potential vulnerabilities.

Issue Description

rucene/src/core/store/io/data_input.rs

Lines 214 to 220 in 5b55f84

    
           let mut buffer = Vec::with_capacity(length); 
        
           unsafe { 
        
               buffer.set_len(length); 
        
           }; 
        
           self.read_exact(&mut buffer)?;

core::store::io::data_input::DataInput::read_string() method creates an uninitialized buffer and passes it to user-provided Read implementation. This is unsound, because it allows safe Rust code to exhibit an undefined behavior (read from uninitialized memory).

This part from the Read trait documentation explains the issue:

It is your responsibility to make sure that buf is initialized before calling read. Calling read with an uninitialized buf (of the kind one obtains via MaybeUninit<T>) is not safe, and can lead to undefined behavior.

How to fix the issue?

The Naive & safe way to fix the issue is to always zero-initialize a buffer before lending it to a user-provided Read implementation. Note that this approach will add runtime performance overhead of zero-initializing the buffer.

As of Feb 2021, there is not yet an ideal fix that works with no performance overhead. Below are links to relevant discussions & suggestions for the fix.

How is it compare to tantivy

https://github.com/tantivy-search/tantivy Is also a library to port Lucene in rust, did you compare the features between this project and tantivy?

build error

just building example code and getting error below,

error[E0599]: no method named `get_ref` found for union `MaybeUninit` in the current scope
   --> /home/oz-mint/.cargo/registry/src/github.com-1ecc6299db9ec823/rucene-0.1.1/src/core/search/query/spans/span_near.rs:509:45
    |
509 |             } else if self.conjunction_span.get_ref().one_exhausted_in_current_doc {
    |                                             ^^^^^^^ method not found in `MaybeUninit<ConjunctionSpanBase<P>>`
    |
   ::: /home/oz-mint/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/pin.rs:804:18
    |
804 |     pub const fn get_ref(self) -> &'a T {
    |                  ------- the method is available for `Pin<&MaybeUninit<span::ConjunctionSpanBase<P>>>` here
    |
help: consider wrapping the receiver expression with the appropriate type
    |
509 |             } else if Pin::new(&self.conjunction_span).get_ref().one_exhausted_in_current_doc {
    |                       ++++++++++                     +

Some errors have detailed explanations: E0308, E0554, E0599, E0635.
For more information about an error, try `rustc --explain E0308`.
error: could not compile `rucene` due to 67 previous errors

any suggestions ?
here https://github.com/ozkanpakdil/rust-examples/tree/main/rucene_test

Indexing too many document fails in one commit fails.

Context: I am adding rucene to https://github.com/tantivy-search/search-benchmark-game.

It is a search benchmarking comparing Lucene, Tantivy, Bleve and now Rucene.
Indexing works but I have to periodically commit to avoid getting a panic.

See the following two lines of code and comment.
https://github.com/tantivy-search/search-benchmark-game/blob/master/engines/rucene-0.1/src/bin/build_index.rs#L103-L104

(I suspect a u32 overflow)

Index too large

The search benchmark consists in indexing all docs in wikipedia en.
To level the field, we merge all segments down to a single segment.

I was happy to see that rucene also implemented force_merge with the blocking option.

Unfortunately after the merge finish, I end up with an index of 24 GB.
(Tantivy and Lucene both end up with an index of 3GB.)

unable to build from source on toolchain recommendation

Failing to build on latest nightly, but also when using rustup run nightly-2019-10-28 cargo build:

error[E0658]: `cfg(doctest)` is experimental and subject to change
  --> /Users/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/memoffset-0.5.6/src/lib.rs:74:7
   |
74 | #[cfg(doctest)]
   |       ^^^^^^^
   |
   = note: for more information, see https://github.com/rust-lang/rust/issues/62210
   = help: add `#![feature(cfg_doctest)]` to the crate attributes to enable

error[E0658]: `cfg(doctest)` is experimental and subject to change
  --> /Users/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/memoffset-0.5.6/src/lib.rs:77:7
   |
77 | #[cfg(doctest)]
   |       ^^^^^^^
   |
   = note: for more information, see https://github.com/rust-lang/rust/issues/62210
   = help: add `#![feature(cfg_doctest)]` to the crate attributes to enable

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0658`.
error: could not compile `memoffset`.
warning: build failed, waiting for other jobs to finish...
error: build failed

Looks correct to me?

rustc --version
rustc 1.40.0-nightly (95f437b3c 2019-10-27)

master分支对应lucene哪个版本？

用master分支build了一个索引，读取segments内容后发现版本是6.4.18？这个版本对应兼容lucene哪个版本呢？

>> read(segments_1)

header length: 35
lucene version: 6.4.18
version: 4
nameCounter: 1
segCount: 1
...

rucene生成的索引是完全兼容原生lucene的吗？
有没有和原生lucene做对比的benchmark数据？
有没有在分布式存储上build索引的测试数据（之前看过你们分享的ppt）？
merge segment的重IO操作rucene的表现怎么样？尤其是在分布式存储上，有没有数据？

PhraseQuery do not work

Phrase query faills with a panic when running on 10_000 wikipedia docs

See the following commented out code.
https://github.com/tantivy-search/search-benchmark-game/blob/master/engines/rucene-0.1/src/bin/do_query.rs#L126-L128

Is this project the whole engine of zhihu search or just a central library of it?

the difference is:
if it's a whole engine of it, we may need to deploy multiple instances by maybe docker or some to ensure the HA and distribution then.
if it just a lib for the engine, (like the lucene against elasticsearch/solr) do we have plan to make the whole engine open source then?

Use after free / aliasing &mut's when using LongsPtr

The following code prints an arbitrary number, because the vec has already been dropped. I can also get aliasing &mut's to the same value by simply calling p.longs() multiple times, since it takes &self.

A fix to this would be to store &mut Vec and have a lifetime parameter inside LongsPtr, and have .longs() take &mut self

fn main() {
    let p = rucene::core::util::LongsPtr::new(&mut vec![15], 0, 0);
    dbg!(p.longs()[0]);
}

Fails to build with latest nightly

Rucene also fails to build with the stable channel (due to the use of #[feature]), so I tried the nightly release:

$ cargo build
   Compiling rucene v0.1.0
error[E0432]: unresolved import `std::boxed::FnBox`
  --> /Users/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/rucene-0.1.0/src/core/util/thread_pool.rs:27:5
   |
27 | use std::boxed::FnBox;
   |     ^^^^^^^^^^^^^^^^^ no `FnBox` in `boxed`

$ rustc --version
rustc 1.41.0-nightly (59947fcae 2019-12-08)

Any plan to target more recent Lucene versions?

It'd be great if we can work with the latest Lucene

Which index codecs are supported?

Lucene has a lucene-backward-codecs library.

In trying to run a Lucene 8 shard, I hit:

Error: Error(CorruptIndex("index format either too new or too old: 4 <= 9 <= 6 doesn\'t hold"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })

Any plans to expand the supported codecs? Could you document which codecs are supported currently? Based on above, I assume between 4 and 6.

Does Rucene support Chinese character indexing and searching

This might be a silly question, does Rucene support Chinese character indexing and searching.

I don't see any tokenizer under the https://github.com/zhihu/rucene/tree/master/src/core/analysis

	let mut buffer = Vec::with_capacity(length);

	unsafe {
	buffer.set_len(length);
	};

	self.read_exact(&mut buffer)?;