blevesearch / bleve Goto Github PK

View Code? Open in Web Editor NEW

9.8K 9.8K 668.0 15.92 MB

A modern text/numeric/geo-spatial/vector indexing library for go

License: Apache License 2.0

Go 99.78% Shell 0.01% Yacc 0.22%

bleve's People

Contributors

Stargazers

Watchers

Forkers

liyinhgqw deoxxa baijum cmdev8 andradeandrey vishalsodani lgs xingskycn jordie tyjohnnew tml yanlinaung thomasvinay neuroradiology bdacode putaozhuose bigxing gaowenbin bozzcq qbuger mistobaan cgiogkarakis jreamlu wojons shugyousha owenthereal liujianping ilovejs growthux godeep simonpeng2009 dongfanliang nemesisqp patricktoca mohae glycerine wcn3 emilgpa fashtimedotcom hihus polaris1119 changguanghua miku avsej pombredanne seacoastboy cw2018 sacheendra g-var gvrv taka011239 mrxiaoz wuchuguang otoolep strogo gsathya looksgood dengmin thurday simapple yl365 hehexianshi linkris zhanglei c4pt0r andrisetiawan gooo000 kenvinwei zhuyong96 typerandom indraniel rli-diraryi doctorwho1986 sxhao cyclefusion brunoga zofuthan spring-zhang zhangf911 tianlin tomzhang betashepherd bigtong tennessine liangyali ngnono alex-xiao funkygao is00hcw hsen-dev ateleshev vimleshs parsegarden johnkewforks suensummit onetodo tukdesk phynalle drewwells lonelypale

bleve's Issues

cjk_width filter

fold fullwidth ASCII variants into the equivalent basic Latin
fold halfwidth Katakana variants into the equivalent Kana

document mapping supports building _all field

token synonym filter

ability to load synonyms from files (like stop word lists)
ability to either expand (index all synonyms)
or contract (consolidate synonyms to single version)

also, investigate wordnet: http://wordnet.princeton.edu/

acquire stop token dictionaries for all languages with stemmer support

Depends on #19

use protobufs to encode index values

While we can't use them for the index keys which we craft to get the desired sort order, we should use protobufs to encode the index values. This will make the binary serialization/deserialization less error prone, more compact, and easier to evolve over time.

support prefix search

Two modes:

Return terms which start with this prefix
Return documents which contain a term starting with this prefix

index term entry should be able to include hierarchical position data

Currently index term entries are:

't'

Would like to add support for also storing the position of this term in any arrays that were a part of the field path.

Not 100% decided that this must be in the key, but that would be the only way to have some hope of efficiently querying on this information.

The idea is to be able to further qualify queries and say that in addition to other query criteria, matching items must occur in the same parent element.

Consider the following documents in an index.

{
  "name": "a",
  "children": [
      {
          "name": "c",
          "age": 25
     },
      {
          "name": "d",
          "age": 15
     },
}

{
  "name": "b",
  "children": [
      {
          "name": "c",
          "age": 15
     },
      {
          "name": "d",
          "age": 25
     },
}

Logically we want to query:
child.name = "c" AND child.age < 20 AND same child

Both documents have a child named "c" and a child who's age is less than 25, but ONLY "b" satisfies both criteria in the same child.

The implementation idea is to include the position in the children array, and the query criteria "same child" is accomplished by verifying that matching items have the same value.

Open() function should have options for create new (default false) and read-only (default false)

support for facet queries

Initial implementation should just operate at query time.

If we swap the field and term order in the index key we can support faceting at query time. For every document satisfying the original query, we can look up the document in the back index, and find entries for the field that is being faceted. Seems like we don't even have to load that key, just be able to parse the field id and terms. For categorical facets the terms are bucketed and counted. For numerical range facets the parsed terms are bucketed and counted. The top-N facets are then returned with the query results.

cjk_bigram

create google group for discussion

update readme to reference google group

galician stemmer

update README to illustrate how easy it is use to get started

Once the new API is done, the project README should illustrate how easy it is to index data and query it.

catalan stemmer

change back index entries to just contain list of keys

Currently the back index contains 2 separate lists of more strongly typed data. This should be changes to just a flat list of keys. This will make it easier to introduce new index row types in the future without having to keep updating the way the back index works.

truncate token length filter

truncate token at the specified max lenght
useful for fields left as a single token

implement common terms query

See http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

create top-level api for indexing

The top-level bleve package should be all one needs to import to achieve the following:

create new index
open existing index
create new default mapping
modify default mapping into custom mapping
index document/object

building bleve with all the c libraries
creating a custom mapping
one for each type of field
one for each type of analyzer/character filter/tokenizer/token filter
- these can be stubs initially, but serve as place holders for adding information over time
one for each type of query