Giter Site home page Giter Site logo

rucene's People

Contributors

3pointer avatar jtong11 avatar sunxiaoguang avatar unix1986 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rucene's Issues

`Read` on uninitialized buffer may cause UB

Hello fellow Rustacean,
we (Rust group @sslab-gatech) found a memory-safety/soundness issue in this crate while scanning Rust code on crates.io for potential vulnerabilities.

Issue Description

let mut buffer = Vec::with_capacity(length);
unsafe {
buffer.set_len(length);
};
self.read_exact(&mut buffer)?;

core::store::io::data_input::DataInput::read_string() method creates an uninitialized buffer and passes it to user-provided Read implementation. This is unsound, because it allows safe Rust code to exhibit an undefined behavior (read from uninitialized memory).

This part from the Read trait documentation explains the issue:

It is your responsibility to make sure that buf is initialized before calling read. Calling read with an uninitialized buf (of the kind one obtains via MaybeUninit<T>) is not safe, and can lead to undefined behavior.

How to fix the issue?

The Naive & safe way to fix the issue is to always zero-initialize a buffer before lending it to a user-provided Read implementation. Note that this approach will add runtime performance overhead of zero-initializing the buffer.

As of Feb 2021, there is not yet an ideal fix that works with no performance overhead. Below are links to relevant discussions & suggestions for the fix.

build error

just building example code and getting error below,

error[E0599]: no method named `get_ref` found for union `MaybeUninit` in the current scope
   --> /home/oz-mint/.cargo/registry/src/github.com-1ecc6299db9ec823/rucene-0.1.1/src/core/search/query/spans/span_near.rs:509:45
    |
509 |             } else if self.conjunction_span.get_ref().one_exhausted_in_current_doc {
    |                                             ^^^^^^^ method not found in `MaybeUninit<ConjunctionSpanBase<P>>`
    |
   ::: /home/oz-mint/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/pin.rs:804:18
    |
804 |     pub const fn get_ref(self) -> &'a T {
    |                  ------- the method is available for `Pin<&MaybeUninit<span::ConjunctionSpanBase<P>>>` here
    |
help: consider wrapping the receiver expression with the appropriate type
    |
509 |             } else if Pin::new(&self.conjunction_span).get_ref().one_exhausted_in_current_doc {
    |                       ++++++++++                     +

Some errors have detailed explanations: E0308, E0554, E0599, E0635.
For more information about an error, try `rustc --explain E0308`.
error: could not compile `rucene` due to 67 previous errors

any suggestions ?
here https://github.com/ozkanpakdil/rust-examples/tree/main/rucene_test

Index too large

The search benchmark consists in indexing all docs in wikipedia en.
To level the field, we merge all segments down to a single segment.

I was happy to see that rucene also implemented force_merge with the blocking option.

Unfortunately after the merge finish, I end up with an index of 24 GB.
(Tantivy and Lucene both end up with an index of 3GB.)

unable to build from source on toolchain recommendation

Failing to build on latest nightly, but also when using rustup run nightly-2019-10-28 cargo build:

error[E0658]: `cfg(doctest)` is experimental and subject to change
  --> /Users/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/memoffset-0.5.6/src/lib.rs:74:7
   |
74 | #[cfg(doctest)]
   |       ^^^^^^^
   |
   = note: for more information, see https://github.com/rust-lang/rust/issues/62210
   = help: add `#![feature(cfg_doctest)]` to the crate attributes to enable

error[E0658]: `cfg(doctest)` is experimental and subject to change
  --> /Users/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/memoffset-0.5.6/src/lib.rs:77:7
   |
77 | #[cfg(doctest)]
   |       ^^^^^^^
   |
   = note: for more information, see https://github.com/rust-lang/rust/issues/62210
   = help: add `#![feature(cfg_doctest)]` to the crate attributes to enable

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0658`.
error: could not compile `memoffset`.
warning: build failed, waiting for other jobs to finish...
error: build failed

Looks correct to me?

rustc --version
rustc 1.40.0-nightly (95f437b3c 2019-10-27)

master分支对应lucene哪个版本?

用master分支build了一个索引,读取segments内容后发现版本是6.4.18? 这个版本对应兼容lucene哪个版本呢?

>> read(segments_1)

header length: 35
lucene version: 6.4.18
version: 4
nameCounter: 1
segCount: 1
...
  1. rucene生成的索引是完全兼容原生lucene的吗?
  2. 有没有和原生lucene做对比的benchmark数据?
  3. 有没有在分布式存储上build索引的测试数据(之前看过你们分享的ppt)?
  4. merge segment的重IO操作rucene的表现怎么样? 尤其是在分布式存储上,有没有数据?

Use after free / aliasing &mut's when using LongsPtr

The following code prints an arbitrary number, because the vec has already been dropped. I can also get aliasing &mut's to the same value by simply calling p.longs() multiple times, since it takes &self.

A fix to this would be to store &mut Vec and have a lifetime parameter inside LongsPtr, and have .longs() take &mut self

fn main() {
    let p = rucene::core::util::LongsPtr::new(&mut vec![15], 0, 0);
    dbg!(p.longs()[0]);
}

Fails to build with latest nightly

Rucene also fails to build with the stable channel (due to the use of #[feature]), so I tried the nightly release:

$ cargo build
   Compiling rucene v0.1.0
error[E0432]: unresolved import `std::boxed::FnBox`
  --> /Users/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/rucene-0.1.0/src/core/util/thread_pool.rs:27:5
   |
27 | use std::boxed::FnBox;
   |     ^^^^^^^^^^^^^^^^^ no `FnBox` in `boxed`

$ rustc --version
rustc 1.41.0-nightly (59947fcae 2019-12-08)

Which index codecs are supported?

Lucene has a lucene-backward-codecs library.

In trying to run a Lucene 8 shard, I hit:

Error: Error(CorruptIndex("index format either too new or too old: 4 <= 9 <= 6 doesn\'t hold"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })

Any plans to expand the supported codecs? Could you document which codecs are supported currently? Based on above, I assume between 4 and 6.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.