fergiemcdowall / search-index Goto Github PK
View Code? Open in Web Editor NEWA persistent, network resilient, full text search library for the browser and Node.js
License: MIT License
A persistent, network resilient, full text search library for the browser and Node.js
License: MIT License
I know that example from documentation uses different notation (['a, b']
) but this is the only way the multiple-word search works for me:
var request = {
"query": {
"slug": ['a', 'b']
},
...
};
Everything works fine if there are some results, in case there is no results found "Cannot get length property of null" error is thrown
if (RIKeySet.length == 1) seekCutOff = (q.pageSize + q.offset);
It should be possible to make the index much smaller. Try to do this.
If you define an indexPath in options, the empty function rm 'si' folder and recreate a new db with 'si' as name. The database defined in options is never emptied.
Faceted search and longer search strings are very fast.
However single word queries that have a large recall can be slower. On the test dataset (reuters) on low end systems, a search for 'usa' takes around 250-300ms to return 12500 docs.
This can be speeded up by iterating through all docvector tokens in the index and caching search results for them.
Hi,
I have a small snippet to add all files and subfolder files to the search-index with the help of the following lines:
var si = require('search-index')({ indexPath: 'index.gz' });
...
for (var i = 0; i < files.length; i++) {
var f = files[i]; // f = file path
debug_si('Add ' + f + '.');
var batchName = 'sona';
var filters = ['path'];
var data = { };
data[f] = { 'path': f };
si.add({'batchName': batchName, 'filters': filters}, data, function (err) {
if (err) {
debug_si('Error adding' + key + '.');
callback(err);
}
});
}
After the code run through a lot of information debug logs will be printed:
....
....
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "sorting tf sets"
[information] "reinserting tf sets"
[information] "[success] incremental calibration complete"
[success] "indexed batch: [object Object]"
But finally I got an exception:
D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\search-index.js:55
callback(msg);
^
TypeError: undefined is not a function
at D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\search-index.js:55:5
at D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\indexing\indexer.js:233:11
at D:\workspace_js\node-track-file-changes\node_modules\search-index\lib\indexing\calibrater.js:43:9
at D:\workspace_js\node-track-file-changes\node_modules\search-index\node_modules\level\node_modules\level-packager\node_modules\levelup\lib\levelup.js:351:9
I don't know why I get this error. Another question for me is what does 'batchName' and 'filters' really does.
Maybe someone can help me.
when simply doing var si = require('search-index');
the following error is thrown:
module.js:340
throw err;
^
Error: Cannot find module 'fstream'
at Function.Module._resolveFilename (module.js:338:15)
at Function.Module._load (module.js:280:25)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at Object.<anonymous> (D:\programming\node_modules\search-index\lib\indexing\replicator.js:3:11)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
fstream
is referenced in ./lib/indexing/replicator.js
but not listed in the package.json
Discussion?
As search-index is upgraded it 'may not' be compatible with existing index schemas, perhaps its time to store a 'semver' flag in the index itself, and check this with the version of search-index in use and [foobar] to user if required.
If agreed I'll try submit update when I get a moment.
Hello,
Thank you for the search-index module. I have a question. In many cases, I don't want to see the search-index logs. How can I ask to search-index to not log information?
Frank
It should be possible to make totally distinct search engine objects.
At the moment, instantiating two or more search-index
s in the same program is problematic
Do instantiation like this:
var SearchIndex = require("search-index");
var si1 = new SearchIndex(options1);
var si2 = new SearchIndex(options2);
Checking for errors is a bit of hassle and the callback signature isn't exactly obvious at the moment. Using things like the async module is a pain, having to write wrappers for every call.
I'm happy to drop in a pull request for this.
There is no longer a pretty indexing log. Winston not playing nice for some reason
It should be possible to filter on a range of values. For instance- time intervals, lat/lon,
At the moment, malformed JSON crashes the system. Make it such that a malformed JSON returns a pretty error, whilst keeping the system up
Make the mapreduce more powerful by using multicore functionality where available
https://nodejs.org/docs/v0.12.0/api/cluster.html
http://blog.carbonfive.com/2014/02/28/taking-advantage-of-multi-processor-environments-in-node-js/
Users can query on one field only. For example, only return docs with the term 'banana' in the title field
Build up a new test stack with karma and jasmine
Concentrate more on logic in search index, and compileability/correctness in Forage. Take web service testing out of Forage.
Some support for english language stemming.
"buying" should give hits for "buy", "buys", "buyer", etc
I have about 900 52k documents that has been created by reading from my sqlite database.
When I loop through the 900 52k docs and do si.add
, it takes forever. The process slows down at around 200 and then indexes really slowly. Is this desired? or am I missing something?
*Update:
I had about 52k documents and not 900
I haven't checked the code thoroughly, but since search-index uses levelup, a streaming API shouldn't be impossible?
Indexing from an fs or http stream makes a lot of sense to me.
I give this error:
/home/fatih/test/node_modules/search-index/lib/mapreduce/searcher.js:42
totalHits = intersection.length;
^
TypeError: Cannot read property 'length' of undefined
at /home/fatih/test/node_modules/search-index/lib/mapreduce/searcher.js:42:31
at /home/fatih/test/node_modules/search-index/node_modules/level-multiply/level-multiply.js:17:15
at proxy (/home/fatih/test/node_modules/search-index/node_modules/level-multiply/node_modules/after/lib/after.js:22:39)
at /home/fatih/test/node_modules/search-index/node_modules/level-multiply/level-multiply.js:29:13
at dispatchError (/home/fatih/test/node_modules/search-index/node_modules/level/node_modules/level-packager/node_modules/levelup/lib/util.js:131:7)
at /home/fatih/test/node_modules/search-index/node_modules/level/node_modules/level-packager/node_modules/levelup/lib/levelup.js:197:14
My code is this:
var si = require('search-index');
var colors = require('colors');
var fs = require('fs');
var data = JSON.parse(fs.readFileSync('node_modules/search-index/test/testdata/reuters-021.json'));
si.add(data, 'reuters-021.json', [], function(indexingMsg)
{
console.log(indexingMsg);
});
console.log("Search data *".underline.red);
si.search({
'query': {
'*':'*'
}
}, function(searchResults) {
console.log(searchResults.green);
});
To be able to set a time range (from time/date to time/date) as a filter/navigator would be great! This is the upstream version of this issue, over at Norch
:
fergiemcdowall/norch#65
Following the principle of Ocrams razor, strip out bloom filter functionality- its not really needed for the direction that search-index is going in.
Should be able to index and retrieve documents from one or more remote indexes. Support sharding
The compiler complains that the file stdint.h
cannot be found. It is available only in the recent MSVC versions. One workaround was suggested on stackoverflow; downloading the file compatible with MSVC from msinttypes.
Would you accept a patch fixing the build on older versions of MSVC, please? I'd detect such versions and include the file from msinttypes stored with a different name than stdint.h.
Is that using leveldb to store index? So, where eveldb files stored, could I specify a location when initialize search-index?
Another question is, if I use Chinese characters, should I segment the sentence separated by a space before insets document into the index.
just like this:
si.add({'doc1':{'title':"中文 字体"}}, batchName, filters, function(msg) {
res.send(msg);
});
use search api like this:
si.search("query": {"*": ["中文 字体"]}, function(msg) {
res.send(msg);
});
Is these right?
It would be helpful if there was a configuration that allows a consumer to set the locatio n of the index.
In reality, indexes are big, should be housed in a different place than the application server code, etc.
Is this something I can add as a pull request?
At the moment you can facet and filter on single values, but not value 'buckets' or ranges. This functionality was present in earlier builds, but has fallen out of the most recent build because of lazy documentation (my bad) and gaps in the test coverage (also my bad).
For new comers wanting to try search-index
as an indexing solution, the documentation is a bit slim and short of examples. For example, what are "facets" ? The snippets in the documentation is all but clear on their purpose and how to use them.
Could there be, at least, a full "working" example of the engine? And perhaps more than
Q: What is a facet?
R: Allows faceted navigation.
for the different options?
Also, what is expected to get when using teaser
? "Creates a field that shows where the search terms exist in the given field." Can an example result be given?
If no ID is specified, autogenerate one, preferably from a timestamp
Hello There,
I might not have gotten to grasp with the internals of the module, but my inital attempt was to create a node script which indexed some data, and another which provided a query http api via express.
It seems that when one process is started there is an IO lock which does not allow the other to read information. I understand that this is locking to ensure the validity of the index.
OpenError: IO error: lock si/LOCK: Resource temporarily unavailable
Is there any way that the above scenario could be tackled?
Thanks,
Fotis
Line 2 in ./lib/search-index.js
colors = require('colors'),
A demo of a search with search index that doesn't run any place else than the browser.
Option to recalibrate index as documents are added
When I use search-index, it asks for fstream
but it is not included in the dependency list.
matcher
(and the whole project, really) seems very handy, but it feels like there's a gap between it and getting search results.
In the use case of a search box with typeahead support, matcher
will get you the suggestions to present to the user, but once the user selects one of the sections, to retrieve the actual relevant documents, you have to then call search
with the actual selection.
It would be great if matcher
gave you back not only matching field strings but also the docID, so you could just call get
instead of search
. Would the right approach for this be for indexDoc
in indexer.js to stuff the docID in with the reverseIndex key?
exports.index = function(batchString, batchName, filters, callback) {
e.g. si.add
is mentioned but no longer exists.
I don't know if I have forgot something, but I index a few documents and when I tried to make a query the query seems to be case sensitive. If I look for "this", and a field contains "This is an example" search-index does not return it.
This is the expected behaviour? Should I index all the document in lowercase?
Facets should be sortable (alphabetic, numeric, magnitude). There should also be a query vocabulary to express this.
This is a way of taking field length into account when talking about term frequency. Shorter fields are typically more meaningful, therefore terms appearing in shorter fields are given a higher value. Should be calculated on indexing.
Users can return all docs in the index by using, say, an asterisk ('*') as the query term.
Line 153:
if (docFreqs[leastFrequentKey] == 0) {
sendResultSet();
}
The code above doesn't return, so it goes on to scan the index and triggers the callback another time. Pull request coming up.
Update the term dictionary as individual documents are added and taken away from the index, but make it optional
Would you consider having search-index
use the levelup
module directly instead of via level
? Along with some way to pass configuration options to the constructor, this would allow use of level-js
so search-index
would be usable in the browser as well as via Node.js. I'd be willing to work on a patch for this.
I believe the correct key for page size is pageSize, but readme.MD shows pagesize.
If using pagesize, totalHits shows the correct value but the hits array is empty
Cache queries, so that results can be returned without performing an actual search.
Must be kept off of a multinode installation since old caches cannot be removed if the keys are hashed, and it is faster if it is local.
Why err param from si.del callback is at true when all is ok ?
In deleteDoc it isn't better to return callback(null, true) instead of return callback(true);
A database might return already-parsed JSON that you want to index. Stringifying the "batch" data before indexing seems like a waste.
Search for "a phrase bounded" by inverted commas. Could possibly be implemented using the magic of ngrams
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.