fergiemcdowall / search-index Goto Github PK

View Code? Open in Web Editor NEW

1.4K 34.0 149.0 17.36 MB

A persistent, network resilient, full text search library for the browser and Node.js

License: MIT License

JavaScript 100.00%

search nlp offline-first

search-index's People

Contributors

Stargazers

Watchers

Forkers

fth-ship n1k0 eivindee thedistractor coverslide gejiajun-hangzhou mpreisinger asaarinen jasonh1 babadofar dialupnoises scitecwri syncbak-git sverrejoh yourdeveloper jimkang sax1johno mko frankrousseau timosaikkonen cloudtokens jindw eiriklv nubizsoft imclab holger-will aral servicenow ceejbot search-alex sebnmuller johnhaley81 ehimah larsvoigt mzilot sachinmjr leonab justinelst aenario forkedreposnaphy hitakaken ethanrubinson jaggedsoft hengkiardo nilgradisnik x710894881 bradparks gavinning connorkrammer cbforks ankitgoel23 joserfjuniorllms granhal blueblock gitter-badger kdawes pandahisham ihenshaw disordinary ce-op olsav anukat2015 liztom nickclaw rtvt123 huangguozhou fdelavega digideskio chriszs maniacs-js rookieokky divsbhalala kreozot blahah hackboy michaelkreil kobble-git craigshoemaker donaldfoss rngadam davidfirst stoodder jdrew1303 mjunaidi yilab gauravrawal lukasz-kaniowski lamour1314 kustomzone eric013 kokeksibir amanelx mikalv teambit fiefdx lmorandini gburgett carrotcomputer gueneler sssession

search-index's Issues

Error thrown when searching using multiple words

I know that example from documentation uses different notation (['a, b']) but this is the only way the multiple-word search works for me:

    var request = {
        "query": {
          "slug": ['a', 'b']
        },
               ...
      };

Everything works fine if there are some results, in case there is no results found "Cannot get length property of null" error is thrown

      if (RIKeySet.length == 1) seekCutOff = (q.pageSize + q.offset);

Index should be smaller

It should be possible to make the index much smaller. Try to do this.

empty function do not use indexPath

If you define an indexPath in options, the empty function rm 'si' folder and recreate a new db with 'si' as name. The database defined in options is never emptied.

"index warm up" functionality

Faceted search and longer search strings are very fast.

However single word queries that have a large recall can be slower. On the test dataset (reuters) on low end systems, a search for 'usa' takes around 250-300ms to return 12500 docs.

This can be speeded up by iterating through all docvector tokens in the index and caching search results for them.

TypError: undefined is not a function search-index.js:55

Hi,

I have a small snippet to add all files and subfolder files to the search-index with the help of the following lines:

var si = require('search-index')({ indexPath: 'index.gz' });

...


for (var i = 0; i < files.length; i++) {
    var f = files[i];  // f = file path
    debug_si('Add ' + f + '.');

    var batchName = 'sona';
    var filters = ['path'];
    var data = { };
    data[f] = { 'path': f };
    si.add({'batchName': batchName, 'filters': filters}, data, function (err) {
        if (err) {
            debug_si('Error adding' + key + '.');
            callback(err);
        }
    });
}

After the code run through a lot of information debug logs will be printed:

....
....
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "[success] incremental calibration complete"
[success] "indexed batch: [object Object]"

But finally I got an exception:

D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\search-index.js:55
    callback(msg);
    ^
TypeError: undefined is not a function
    at D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\search-index.js:55:5
    at D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\indexing\indexer.js:233:11
    at D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\indexing\calibrater.js:43:9
    at D:\workspace_js\node-track-file-changes\node_modules\search-index\node_modules\level\node_modules\level-packager\node_modules\levelup\lib\levelup.js:351:9

I don't know why I get this error. Another question for me is what does 'batchName' and 'filters' really does.

Maybe someone can help me.

dependencies missing?

when simply doing var si = require('search-index'); the following error is thrown:

module.js:340
    throw err;
    ^
Error: Cannot find module 'fstream'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (D:\programming\node_modules\search-index\lib\indexing\replicator.js:3:11)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)

fstream is referenced in ./lib/indexing/replicator.jsbut not listed in the package.json

As search-index is upgraded it 'may not' be compatible with existing index schemas, perhaps its time to store a 'semver' flag in the index itself, and check this with the version of search-index in use and [foobar] to user if required.

If agreed I'll try submit update when I get a moment.

It it possible to make search-index silent?

Hello,

Thank you for the search-index module. I have a question. In many cases, I don't want to see the search-index logs. How can I ask to search-index to not log information?

Frank

Proper object instantiation

It should be possible to make totally distinct search engine objects.

At the moment, instantiating two or more search-indexs in the same program is problematic

Do instantiation like this:

var SearchIndex = require("search-index");
var si1 = new SearchIndex(options1);
var si2 = new SearchIndex(options2);

Make callbacks adhere to the standard function (err, results) style

Checking for errors is a bit of hassle and the callback signature isn't exactly obvious at the moment. Using things like the async module is a pain, having to write wrappers for every call.

I'm happy to drop in a pull request for this.

Logging in seems to be broken for indexing

There is no longer a pretty indexing log. Winston not playing nice for some reason

Range filters

It should be possible to filter on a range of values. For instance- time intervals, lat/lon,

If JSON is malformed, fail gracefully, with error message

At the moment, malformed JSON crashes the system. Make it such that a malformed JSON returns a pretty error, whilst keeping the system up

Multicore processing

Make the mapreduce more powerful by using multicore functionality where available

https://nodejs.org/docs/v0.12.0/api/cluster.html

http://blog.carbonfive.com/2014/02/28/taking-advantage-of-multi-processor-environments-in-node-js/

Fielded search

Users can query on one field only. For example, only return docs with the term 'banana' in the title field

Testing

Build up a new test stack with karma and jasmine

Concentrate more on logic in search index, and compileability/correctness in Forage. Take web service testing out of Forage.

Stemming

Some support for english language stemming.

"buying" should give hits for "buy", "buys", "buyer", etc

Slow indexing after 200 or so documents

I have about ~~900~~ 52k documents that has been created by reading from my sqlite database.

When I loop through the ~~900~~ 52k docs and do si.add, it takes forever. The process slows down at around 200 and then indexes really slowly. Is this desired? or am I missing something?

*Update:
I had about 52k documents and not 900

Streaming API

I haven't checked the code thoroughly, but since search-index uses levelup, a streaming API shouldn't be impossible?

Indexing from an fs or http stream makes a lot of sense to me.

TypeError: Cannot read property 'length' of undefined

I give this error:

/home/fatih/test/node_modules/search-index/lib/mapreduce/searcher.js:42
      totalHits = intersection.length;
                              ^
TypeError: Cannot read property 'length' of undefined
    at /home/fatih/test/node_modules/search-index/lib/mapreduce/searcher.js:42:31
    at /home/fatih/test/node_modules/search-index/node_modules/level-multiply/level-multiply.js:17:15
    at proxy (/home/fatih/test/node_modules/search-index/node_modules/level-multiply/node_modules/after/lib/after.js:22:39)
    at /home/fatih/test/node_modules/search-index/node_modules/level-multiply/level-multiply.js:29:13
    at dispatchError (/home/fatih/test/node_modules/search-index/node_modules/level/node_modules/level-packager/node_modules/levelup/lib/util.js:131:7)
    at /home/fatih/test/node_modules/search-index/node_modules/level/node_modules/level-packager/node_modules/levelup/lib/levelup.js:197:14

My code is this:

var si = require('search-index');
var colors = require('colors');
var fs = require('fs');

var data = JSON.parse(fs.readFileSync('node_modules/search-index/test/testdata/reuters-021.json'));

si.add(data, 'reuters-021.json', [], function(indexingMsg)
{
    console.log(indexingMsg);
});

console.log("Search data *".underline.red);

si.search({
    'query': {
        '*':'*'
    }
}, function(searchResults) {
    console.log(searchResults.green);
});

Time/date navigators

To be able to set a time range (from time/date to time/date) as a filter/navigator would be great! This is the upstream version of this issue, over at Norch:
fergiemcdowall/norch#65

Strip out bloom filter functionality

Following the principle of Ocrams razor, strip out bloom filter functionality- its not really needed for the direction that search-index is going in.

Multinode functionality

Should be able to index and retrieve documents from one or more remote indexes. Support sharding

Building with MSVC 2008 fails

The compiler complains that the file stdint.h cannot be found. It is available only in the recent MSVC versions. One workaround was suggested on stackoverflow; downloading the file compatible with MSVC from msinttypes.

Would you accept a patch fixing the build on older versions of MSVC, please? I'd detect such versions and include the file from msinttypes stored with a different name than stdint.h.

Where leveldb files stored?

Is that using leveldb to store index? So, where eveldb files stored, could I specify a location when initialize search-index?
Another question is, if I use Chinese characters， should I segment the sentence separated by a space before insets document into the index.
just like this:

si.add({'doc1':{'title':"中文 字体"}}, batchName, filters, function(msg) {
  res.send(msg);
});

use search api like this:

si.search("query": {"*": ["中文 字体"]}, function(msg) {
  res.send(msg);
});

Is these right?

Index Location Configurabilty

It would be helpful if there was a configuration that allows a consumer to set the locatio n of the index.

In reality, indexes are big, should be housed in a different place than the application server code, etc.

Is this something I can add as a pull request?

Ability to facet and filter on a _range_ of values

At the moment you can facet and filter on single values, but not value 'buckets' or ranges. This functionality was present in earlier builds, but has fallen out of the most recent build because of lazy documentation (my bad) and gaps in the test coverage (also my bad).

When adding new doc, automatically delete any existing docs with same ID

Autogeneration of IDs

If no ID is specified, autogenerate one, preferably from a timestamp

Separate index process from search

Hello There,

I might not have gotten to grasp with the internals of the module, but my inital attempt was to create a node script which indexed some data, and another which provided a query http api via express.

It seems that when one process is started there is an IO lock which does not allow the other to read information. I understand that this is locking to ensure the validity of the index.

OpenError: IO error: lock si/LOCK: Resource temporarily unavailable

Is there any way that the above scenario could be tackled?

Thanks,
Fotis

Unnecessary "colors" module

Line 2 in ./lib/search-index.js

colors = require('colors'),

A browserify demo of search-index

A demo of a search with search index that doesn't run any place else than the browser.

When adding documents, build in optional recalibration

Option to recalibrate index as documents are added

fstream is missing from the dependencies

When I use search-index, it asks for fstream but it is not included in the dependency list.

Can `matcher()` return docIDs along with strings?

matcher (and the whole project, really) seems very handy, but it feels like there's a gap between it and getting search results.

In the use case of a search box with typeahead support, matcher will get you the suggestions to present to the user, but once the user selects one of the sections, to retrieve the actual relevant documents, you have to then call search with the actual selection.

It would be great if matcher gave you back not only matching field strings but also the docID, so you could just call get instead of search. Would the right approach for this be for indexDoc in indexer.js to stuff the docID in with the reverseIndex key?

Unnecessary "batchName" argument to index function

Documentation and code isn't matching
Code takes an argument only for logging, which feels unneccessary

exports.index = function(batchString, batchName, filters, callback) {

README is out-of-date.

e.g. si.add is mentioned but no longer exists.

Option for case insensitive indexing/searching

I don't know if I have forgot something, but I index a few documents and when I tried to make a query the query seems to be case sensitive. If I look for "this", and a field contains "This is an example" search-index does not return it.

This is the expected behaviour? Should I index all the document in lowercase?

Sort functionality for facets

Facets should be sortable (alphabetic, numeric, magnitude). There should also be a query vocabulary to express this.

Take field length norm into account

This is a way of taking field length into account when talking about term frequency. Shorter fields are typically more meaningful, therefore terms appearing in shorter fields are given a higher value. Should be calculated on indexing.

Wildcard search

Users can return all docs in the index by using, say, an asterisk ('*') as the query term.

mapreduce.search triggers callback twice when key frequency is 0

Line 153:

  if (docFreqs[leastFrequentKey] == 0) {
    sendResultSet();
  }

The code above doesn't return, so it goes on to scan the index and triggers the callback another time. Pull request coming up.

When deleting documents, build in optional recalibration

Update the term dictionary as individual documents are added and taken away from the index, but make it optional

levelUP module configuration

Would you consider having search-index use the levelup module directly instead of via level? Along with some way to pass configuration options to the constructor, this would allow use of level-js so search-index would be usable in the browser as well as via Node.js. I'd be willing to work on a patch for this.