rnewson / couchdb-lucene Goto Github PK

View Code? Open in Web Editor NEW

768.0 768.0 147.0 3.48 MB

Enables full-text searching of CouchDB documents using Lucene

License: Apache License 2.0

Python 1.32% Shell 2.22% Java 96.32% Batchfile 0.14%

couchdb-lucene's People

Contributors

Stargazers

Watchers

Forkers

eee-c jamescarr mlmiller jri tisba sakrafd ato studiopete truemped sanjayayogi nborwankar jhs galaxycats gstamp couchone sbisbee mmm444 anandology odracci czue joehillen fmw hofmeister zavoloka sequoiar daugaard stefankoegl hordejcuk ryanramage nido artikh belle96 timkuijsten moonmaster9000 grapestack scottraio ktmud judywawira jalpedersen sengsengfyw nesteffe akshar100 lt1946 darob ahoff kr428 hudson2010 jimklo fadrizul tanarurkerem okev xiedantibu dileepbapat jkary patricklodder mitchellrj adazzi thomasernste ifad chmoonka fyatao turbocontext stephanu karyjan smileythane kumaravadivel luman75 sigh0829 hihihippp sirithink csm artootrills darylyu nithya12 chenlonggang kantanand 85degree guanyu-240 kika gopisf obarro lucag janl ehealthafrica inator likegit-2013 emig jhpadjustable cbmarc-labs asafhamtzany leduycuong86 shailendra333 heydenreich georgekankava javascriptextjs braindesire rgregat zhouj-github sobolsigizmund gdmuzzillo

couchdb-lucene's Issues

jar file in downloads section is invalid

CouchDB throws:
Invalid or corrupt jarfile /home/jri/couchdb-lucene-0.4-jar-with-dependencies.jar
[error] [<0.49.0>] OS Process died with status: 1

This corresponds to a integrity check:
wget http://cloud.github.com/downloads/rnewson/couchdb-lucene/couchdb-lucene-0.4-jar-with-dependencies.jar.gz
gunzip couchdb-lucene-0.4-jar-with-dependencies.jar.gz
unzip -t couchdb-lucene-0.4-jar-with-dependencies.jar

results in
Archive: couchdb-lucene-0.4-jar-with-dependencies.jar
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.

"method" -> "verb" rename in couchdb r893364 breaks c-l

Line 82 of the python external has "method" hardcoded. We should probably be testing for "verb" in req and use that if it's there, otherwise fall back to "method" for compatibility with older versions.

http://mail-archives.apache.org/mod_mbox/couchdb-commits/200912.mbox/%[email protected]%3E

querying utf-8 documents fails

When I enter documents containing utf8-chars like öäüß in a couchdb via Futon and query them afterwards with

curl http://127.0.0.1:5984/notes_development/_fti/lucene/by_title?q=pop*

I get the following error:

{"error":"ucs","reason":"{bad_utf8_character_code}"}

(running under OsX 10.5, ~ > java -version
java version "1.5.0_16"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284)
Java HotSpot(TM) Client VM (build 1.5.0_16-133, mixed mode, sharing)

-- Frank

Boosting Ranks on certain search items

How do I boost the ranks for certain search fields? I know this is in lucene somewhere, but how do I access it in cdb-l?

Cheers
Rohit

Make couchdb.lucene.operator a query-parameter

It would be nice to have the default lucene-operator configurable per query or at least per design-document. Image you have two applications served from the same couchdb instance. In one you like OR as the default and in the other one AND. Or maybe you have different query-types...

NumberRangeQuery too eagerly created in couchdb-lucene's CustomQueryParser

I have the following problem: I have a field of type string that I want to make a query on, say field:[00.0 TO 00.17]. Unfortunately, this doesn't work correctly, since CustomQueryParser assumes everything that looks like a Number will have been indexed as a number, which is too strong an assumption IMHO.

BTW. field[-00.0 TO 00.17] works, because CustomQueryParser only recognizes nonnegative numbers (but that's a different issue).

"foo is not a valid view" when database contains no documents

Trying to search a database that contains no documents (besides the couchdb-lucene design document) brings up the error "foo is not a valid view." This is misleading, as the view works properly when there are documents in the database.

latest couchdb-lucene not indexing

I have successfully built and run couchdb-lucene 0.5 (github HEAD), with couchdb 0.10.1 (via MacPorts) on OS X following the instructions.

But I don't get any indication that it is trying to index my databases. The JVM daemon starts, but the indexes folder is empty, couchdb.log shows no attempt from couchdb-lucene to do anything with the DBs, and no couchdb-lucene.log is created.

Even if I have more work to do to set up the index functions, I'd still expect to see some activity from couchdb-lucene to query couchdb to get a list of database, design docs, _changes, etc. Right?

I feel like i'm missing something simple, but lots of re-reading the docs and tweaking configs hasn't helped.

Occasional 500 errors

Occasionally the server will return:
HTTP/1.1 500 Internal Server Error
Server: CouchDB/0.10.0 (Erlang OTP/R12B)
Date: Wed, 23 Dec 2009 04:08:46 GMT
Content-Type: text/plain
Content-Length: 603

Traceback (most recent call last):
  File "/usr/lib/couchdb/couchdb-lucene/couchdb-external-hook.py", line 41, in main
    resp = respond(res, req, opts.local_host, opts.local_port)
  File "/usr/lib/couchdb/couchdb-lucene/couchdb-external-hook.py", line 86, in respond
    resp = res.getresponse()
  File "/usr/lib/python2.4/httplib.py", line 872, in getresponse
    response.begin()
  File "/usr/lib/python2.4/httplib.py", line 336, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.4/httplib.py", line 300, in _read_status
    raise BadStatusLine(line)
BadStatusLine

Using the most recent version of couchdb-lucene (at time of posting) and couchdb 0.10.0
The majority of the time the response is fine.
There is no load on the server.
bin/run does not report any errors.

FEATURE REQUEST: Indexing views

invalid syntax on line 82 of couchdb-external-hook.py

I'm getting the following error with the latest version of couchdb-lucene:

SyntaxError: invalid syntax
File "/usr/lib/couchdb/couchdb-lucene/couchdb-lucene-0.5-0.2/tools/couchdb-external-hook.py", line 82
method = req["method"] if "method" in req else req["verb"]

$ python -V
Python 2.4.3

Changing the line to this fixes the error:

method = req["verb"]
if "method" in req:
    method = req["method"]

dave

Searching without using an analyzer not possible

If I index with "index":"not_analyzed" I get into trouble when trying to search for these terms, as there is no way to bypass the analyzer when issuing search queries. The possibility to use a URI parameter like "analyzer=none" or "analyzer=null" etc. would be great. Or do I miss something?

Integrating local lucene to c-l

I was just wondering, if it is possible to integrate local lucene to couchdb-lucene. If have not looked deeper into l-l, but it sounds very interesting. local lucene is available here: http://sourceforge.net/projects/locallucene/

Maybe I'll try to integrate local lucene to c-l myself, when I find the mood to dig into Java ;)

Include "_design/" prefix in design doc pathname parser

Currently, the "_design/" component isn't considered when parsing an _fti request.

This doesn't conform to standard couchdb URL conventions and seems to create a problem with multiple index functions in multiple design documents.

Indexer silently dies when FSDirectory cannot be created

Exception looks like:

java.io.IOException: Cannot create directory: /some/inaccessible/dir/lucene
        at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:175)
        at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:139)
        at com.github.rnewson.couchdb.lucene.Index.main(Index.java:324)
        at com.github.rnewson.couchdb.lucene.Main.main(Main.java:32)

See [skarab/couchdb-lucene@e0ec731] for small patch to catch unhandled IOException.

couchdb-lucene 0.5 does not properly handle databases with '/' in their names

The problem at least exists when the index is queried, but I think the problem persists throughout, including places where c-l queries CouchDB.

Build fails on Osx 1.5.7

Hi,

building the project with 'mvn' on Osx (Java 1.6 or 1.5) results in some failing tests (see gist below).
-- Frank

http://gist.github.com/140014

Problem with too much documents and 0.4 version

Just upgraded to 0.4 I launch couchdb and I get this from the indexer.
With other databases that are smaller I get no problem.
I've tried to delete the index and try it again, check perms of the files and all that stuff and the problem still there.

[info] [<0.59.0>] 127.0.0.1 - - 'GET' /fisica-nist3/_all_docs_by_seq?startkey=53250&limit=250&include_docs=true 200
[couchdb-lucene] WARN Exception while updating index.
java.io.FileNotFoundException: /usr/local/bin/lucene/_b.fnm (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:233)
at org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.(FSDirectory.java:552)
at org.apache.lucene.store.FSDirectory$FSIndexInput.(FSDirectory.java:582)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:58)
at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:341)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:236)
at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:915)
at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4339)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3579)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3450)
at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1638)
at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2748)
at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2683)
at com.github.rnewson.couchdb.lucene.Index$Indexer.updateIndex(Index.java:239)
at com.github.rnewson.couchdb.lucene.Index$Indexer.run(Index.java:96)
at java.lang.Thread.run(Thread.java:636)
[couchdb-lucene] INFO indexer stopped.
[info] [<0.59.0>] 127.0.0.1 - - 'GET' /fisica-nist3/_all_docs_by_seq?startkey=53500&limit=250&include_docs=true 200

Support for couchdb _list functions

The resulting documents from a lucene-query, should be formatable using couch-db's _list API.

For instance with a URI like:

/database/_fti/lucene/**_list/listname/**lucene _idx_name?q=

Retrieve index statistics

I'm finding that trying to figure out what's in the index is a PITA. Been trying to debug if view indexing ran or if my function is working it'd be nice to have access to a JSON blob that showed what 'sub-indexes' exist and the list of fields in each index or similar. And perhaps things like the term count per field. Similar to what Luke shows in it's overview.

That is all.

[Minor] Header labels could be properly cased

In the response headers, couchdb uses 'Content-Type', and couchdb-lucene uses 'content-type'.
According to the RFC 4.2, header field names are case-insensitive, but it being different from couchdb tripped up the library I'm using.

Build couchdb-lucene: jcip-annotations error

When building couchdb-lucene with maven2 on mac os x 10.5.8 I got following error:

Failure executing javac, but could not parse the error: An exception has occurred in the compiler (1.5.0_22). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport) after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report. Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: file net/jcip/annotations/GuardedBy.class not found

This is caused by an know issue in HttpClient that is only fixed in the 4.1alpha version.
https://issues.apache.org/jira/browse/HTTPCLIENT-866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Fix is to edit the pom.xml of couchdb-lucene and add jcip annotations to the dependencies by putting this in:


    
        net.jcip
        jcip-annotations
        1.0

After I did this it compiles fine. Probably once the new (non alpha) version of HttpClient is released this workaround won't be needed anymore.

TypeError when creating lucene document

Hi,

I'm using couchdb-lucene 0.5 as of today (01-07-2010) and couchdb 0.10.

When couchdb-lucene is indexing my documents, during the first access of the view I'm getting a TypeError exception that's preventing the document to be indexed:

2010-01-07 23:45:08,859 WARN couchdb.lucene.ViewIndexer.localhost/5984/knownet/bag/knsearch 16bb3f66e8d4c88d11b4e5c092ba38ab caused TypeError: Cannot find default value for object. (unnamed script#37)

I modified the DocumentConverterTest.java class to using my documents and my functions and the exception is still being thrown:

marcosvm@pepita:~/Servers/couchdb-lucene-0.5-SNAPSHOT $ jruby document_converter_test.rb
Loaded suite document_converter_test
Started
E
Finished in 0.258 seconds.

Error:
test_document_conversion(DocumentConverterTest):
NativeException: org.mozilla.javascript.EcmaError: TypeError: Cannot find default value for object. (single#49)
org/mozilla/javascript/ScriptRuntime.java:3654:in constructError' org/mozilla/javascript/ScriptRuntime.java:3632:inconstructError'
org/mozilla/javascript/ScriptRuntime.java:3660:in typeError' org/mozilla/javascript/ScriptRuntime.java:3672:intypeError1'
org/mozilla/javascript/ScriptableObject.java:781:in getDefaultValue' org/mozilla/javascript/ScriptableObject.java:700:ingetDefaultValue'
org/mozilla/javascript/ScriptRuntime.java:724:in toString' org/mozilla/javascript/ScriptRuntime.java:3741:innotFunctionError'
org/mozilla/javascript/ScriptRuntime.java:2247:in getPropFunctionAndThisHelper' org/mozilla/javascript/ScriptRuntime.java:2214:ingetPropFunctionAndThis'
org/mozilla/javascript/gen/single:49:in _c0' org/mozilla/javascript/gen/single:-1:incall'
org/mozilla/javascript/ContextFactory.java:398:in doTopCall' org/mozilla/javascript/ScriptRuntime.java:3065:indoTopCall'
org/mozilla/javascript/gen/single:-1:in call' com/github/rnewson/couchdb/lucene/DocumentConverter.java:59:inconvert'
document_converter_test.rb:23:in `test_document_conversion'

1 tests, 0 assertions, 0 failures, 1 errors

I'm not sure how to provide this default value required by Rhino or if it's an actual bug during the document conversion.

I put a copy of the function and one document here: http://gist.github.com/271912

Any help would be appreciated.

Thanks is advance,
Marcos

Handling list values In fulltext search couchdb-lucene is broken.

Handling list values In fulltext search couchdb-lucene is broken.
ref: Pls visit- "Index Everything example" @ http://github.com/rnewson/couchdb-lucene/blob/master/README.md

Ideally, if the value of a key is a array object, its contents should be indexed as a joined string.
But currently, it would ignore the key, and create additional keys based on the positions of the array items which is undesirable.

For example lets say we have a JSON data structure (DS) like this:
{'cars':['alto','mercedes','mahindra']}
The keys generated while indexing this DS are 0,1 and 2 and the values correspondingly are alto,mercedes and mahindra, while ignoring "cars" key - which is clearly not what we want to do.

The expected behaviour is to generate single key "cars" with values of the array joined (delimiter comma) as "alto,mercedes,mahindra" and then index them.
I hope its clear!

thanks

JSONP callback should not be returned in quotes

A Fulltext query with the "callback" parameter returns a (JSON-encoded ?) double-quoted string. For callbacks to work, this must be a method invocation in JavaScript, without quotes. Just as in normal CouchDB views with the "callback" Parameter.

So instead of
"cb({"q":"default: ... "})"

return
cb({"q":"default:..."})

JsonToRhinoConverter issue when null value specified in doc

I'm not too familiar with how the JSON -> Rhino conversion happens, but it seems that the wrong thing is done when a javascript property exists but its value is null. It seems that in JsonToRhinoConverter, ScriptableObject.putProperty will place a converted JSONNull value when it should probably be deleting it (or maybe handling it differently). The current behavior results in unexpected strings such as "com.github.rnewson.couchdb.rhino.JsonToRhinoConverter..." to end up indexed instead of the fields we want.

I apologize if this is unclear, but I did put a workaround in for now that results in the behavior I expect, though I'm pretty sure I'm not doing this the right way :)

http://gist.github.com/278496

java.lang.ClassCastException: JSON keys must be strings.

I recently started getting this error which is preventing new document changes from being indexed. Any ideas??

Oh, and this is with couchdb 0.9.1 and couchdb-lucene 0.4.

[couchdb-lucene] ERROR Error updating index.
java.lang.ClassCastException: JSON keys must be strings.
    at net.sf.json.JSONObject._fromJSONObject(JSONObject.java:1067)
    at net.sf.json.JSONObject.fromObject(JSONObject.java:177)
    at net.sf.json.JSONSerializer.toJSON(JSONSerializer.java:108)
    at net.sf.json.JSONArray._processValue(JSONArray.java:2535)
    at net.sf.json.JSONArray.processValue(JSONArray.java:2593)
    at net.sf.json.JSONArray.addValue(JSONArray.java:2580)
    at net.sf.json.JSONArray.element(JSONArray.java:1753)
    at net.sf.json.JSONArray.fromObject(JSONArray.java:183)
    at net.sf.json.JSONSerializer.toJSON(JSONSerializer.java:113)
    at net.sf.json.JSONObject._processValue(JSONObject.java:2759)
    at net.sf.json.JSONObject.processValue(JSONObject.java:2852)
    at net.sf.json.JSONObject.element(JSONObject.java:1891)
    at net.sf.json.JSONObject._fromJSONTokener(JSONObject.java:1175)
    at net.sf.json.JSONObject.fromObject(JSONObject.java:181)
    at net.sf.json.util.JSONTokener.nextValue(JSONTokener.java:370)
    at net.sf.json.JSONArray._fromJSONTokener(JSONArray.java:1160)
    at net.sf.json.JSONArray.fromObject(JSONArray.java:149)
    at net.sf.json.util.JSONTokener.nextValue(JSONTokener.java:373)
    at net.sf.json.JSONObject._fromJSONTokener(JSONObject.java:1147)
    at net.sf.json.JSONObject._fromString(JSONObject.java:1337)
    at net.sf.json.JSONObject.fromObject(JSONObject.java:187)
    at net.sf.json.JSONObject.fromObject(JSONObject.java:156)
    at com.github.rnewson.couchdb.lucene.Database.getAllDocsBySeq(Database.java:87)
    at com.github.rnewson.couchdb.lucene.Index$Indexer.updateDatabase(Index.java:262)
    at com.github.rnewson.couchdb.lucene.Index$Indexer.updateIndex(Index.java:199)
    at com.github.rnewson.couchdb.lucene.Index$Indexer.run(Index.java:95)
    at java.lang.Thread.run(Thread.java:619)

Bulk insert does not trigger index update

When inserting/creating a large number of new documents via CouchDB's bulk API, the Lucene index is not updated accordingly.

Crashes on OS X 10.6.1

I'm having a crashing issue with CouchDB 0.10.0 and couchdb-lucene 0.4 on OS X 10.6.1.

Here's my local.ini: http://gist.github.com/231189
Here's the couch.log output: http://gist.github.com/231188

I see something that might be the issue:

[Tue, 10 Nov 2009 19:24:53 GMT] [debug] [<0.5768.0>] OS Proc: Unknown info: {#Port<0.4098>,
{data,{eol,<<"Error occurred during initialization of VM">>}}}

[Tue, 10 Nov 2009 19:24:53 GMT] [debug] [<0.5768.0>] OS Proc: Unknown info: {#Port<0.4098>,
{data,{eol,<<"Unable to load native library: libjava.jnilib">>}}}

...but I can't tell how to go about fixing and/or troubleshooting this further.

Any help very much appreciated. Thank you!

crashes on Mac OS X 10.5.8

When I enable couch lucene in my local.ini file and restart couchdb, couchdb itself works fine but every few seconds the lucene indexer crashes. Configuration info and logs included below.

-----INFO-----
Mac OS X 10.5.8 on powerpc
Apache CouchDB 0.10.0a799093
compiled couchdb-lucene with standard mac os x java version "1.5.0_19"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02-304)
Java HotSpot(TM) Client VM (build 1.5.0_19-137, mixed mode, sharing)
-----T E S T S-----
Running com.github.rnewson.couchdb.lucene.LanguageIdentifierTest
Tests run: 12, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 46.392 sec
Running com.github.rnewson.couchdb.lucene.RhinoTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 15.149 sec
Running com.github.rnewson.couchdb.lucene.TikaTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.351 sec
Running org.apache.nutch.analysis.lang.LanguageIdentifierTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 18.844 sec
Results :
Tests run: 25, Failures: 0, Errors: 0, Skipped: 1

local.ini = http://gist.github.com/169027
couch.log = http://gist.github.com/169025
crash report = http://gist.github.com/169028

missing special field _db

i can't find the special _db field anywhere in the 0.5 code, is it still there?

i want to add the database name of the source document when indexing. I can add an extra field for the database name to the documents, but that doesn't feel right from a programmer's point of view : )

Allow setting of default operator [enhancement]

Lucene allows you to change the default operator for queries from OR to AND. It would be nice if this option could be utilized when making queries through couchdb.

http://incubator.apache.org/lucene.net/docs/2.1/Lucene.Net.QueryParsers.QueryParser.SetDefaultOperator.html

Patch for 0.4 trunk support

Thanks for your work first of all !

Here is a small patch to make the 0.4x branch working with couchdb trunk, mostly due to "_changes", new feature:
http://friendpaste.com/7933r3BlrJ4pJM3WBjanM6

here is a python test case for this patch :
http://friendpaste.com/6ChcO6yunH4jvi5c2sUfM

The patch pass this simple test case. Hope I've not forgotten something.

Regards,

xav

Support JSONP style callbacks [enhancement]

CouchDB has added jsonp callback support in the latest (0.10 version. ).

To add this feature to couchdb-lucence entails modifying SearchRequest.java in the following way (untested):

SearchRequest constructor:

// Parse callback argument
this.callback = query.optString("callback");

execute method:
// check if to return result callback-style before end of method
if (callback != "") {
result.put("json", callback + "(" + json + ")");
} else {
result.put("json", json);
}

Losing checkpoints after restarting bin/run

...
2009-12-23 13:21:35,489 INFO [localhost/5984/cgm/couchapp/by_name] Committed checkpoint at update_seq 357773
(used Ctrl+C to stop process)
[root@localhost couchdb-lucene]# bin/run
2009-12-23 14:38:31,722 INFO [Main] Index output goes to: /usr/lib/couchdb/couchdb-lucene/indexes
2009-12-23 14:38:31,789 INFO [Main] Accepting connections with SelectChannelConnector@localhost:5985
2009-12-23 14:39:10,750 INFO [localhost/5984/cgm/couchapp/by_name] Starting.
2009-12-23 14:39:22,220 INFO [localhost/5984/cgm/couchapp/by_name] Committed checkpoint at update_seq 209712
2009-12-23 14:39:32,643 INFO [localhost/5984/cgm/couchapp/by_name] Committed checkpoint at update_seq 211543
...

Using the most recent version of couchdb-lucene at the time of this posting, and couchdb 0.10.0

0.5 compileFunction bug?

I'm trying to get c-l 0.5 working on couchdb 0.10.1 on fedora core 5.3. I have c-l successfully running on a different system running 0.9 and c-l 0.4. Perhaps I'm doing something very obvious wrong here, but I can't find the same issue anywhere else.

This is the trace I'm getting, with one of the example methods:

http://pastie.org/765409

It doesn't matter what I put in the index function for the _fti index, it keeps coming up with this error, even if I change the function to a mere 'return null;', it returns the same error. On 0.4, no problems whatsoever.

I'll downgrade to 0.4 for now, but I just wanted to check if perhaps anyone else has bumped into this issue with 0.5 as well.

Case sensitive problem

I've many documents like this
{Doctype:UNITS} or
{Doctype:SOURCE}

I've take a look into CouchDB and the document conserve the caps, I looked into lucene index with luke and the value still have the caps, but if I try to do a search with the caps like this:

http://localhost:5984/fisica-nist3/_fti/lucene/by_subject/?q=Doctype:UNITS

I get no documents and in the rewrite of the output I can see Doctype conserve the caps but UNITS goes downcase:

{"q":"Doctype:units","etag":"122bb567222","view_sig":"76efdc8dfb9d98ed577f2b7640228de4","skip":0,"limit":25,"total_rows":0,"search_duration":1,"fetch_duration":0,"rows":[]}

Maybe something is changing it before lauch the query against lucene?

updating an index function

Robert,

Is there a way to update an fulltext function without losing the current index while the new one is building? I need to rebuild an index that originally took a couple days to build and can't lose my search while it's rebuilding.

I've tried adding a new design doc and renaming it to the original once the index was complete, but couch won't let me change the id of the doc. The only other thing i could think of doing was to create the new design doc and change my code to use it once it's complete (I'm hoping to avoid this if possible).

Thanks,
Dave

Errors from adding a null field to an index are hard to diagnose.

With an indexing function such as:

function(doc) { var ret=new Document(); ret.add(doc.name); return ret }

If doc.name is ever null, indexing fails, causing couchdb-lucene to no longer recognize that view as valid, which is difficult to diagnose. Also, couchdb throw the error "first argument must be non-null." on startup.

Workaround:

function(doc) { var ret=new Document(); if (doc.name) { ret.add(doc.name); return ret } }

couchdb-lucene should ignore null fields rather than dying.

Query multiple terms via POST?

If I need to query multiple terms at the same time, one option is to concatenate them with "OR" and pass them as "q" parameter via GET request. This will give me the all hits I want, but the results lose the track of which hit is returned from which term.

I am wondering if it's possible to submit multiple terms via POST (e.g. {"queries": ["q1", "q2",...]}' and return a list of hits based on the order of input queries. It's just like the POST request of '{"keys": ["key1", "key2", ...]}' to a couchdb view URL for multiple queries.

Multiple Revisions included in Search

I see duplicate results in the search, because of earlier revisions of the document. Is there a way to have couchdb-lucene only index the latest revision?

Problems building v0.4

I'm trying to build from the v0.4 tag, and I'm getting the following from maven:

T E S T S

Running com.github.rnewson.couchdb.lucene.LanguageIdentifierTest
Tests run: 12, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 2.164 sec
Running com.github.rnewson.couchdb.lucene.RhinoTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.574 sec
Running com.github.rnewson.couchdb.lucene.IntegrationTest
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 52.468 sec <<< FAILURE!
Running com.github.rnewson.couchdb.lucene.LuceneTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.007 sec <<< FAILURE!
Running com.github.rnewson.couchdb.lucene.TikaTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.748 sec
Running org.apache.nutch.analysis.lang.LanguageIdentifierTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.507 sec

Results :

Tests in error:
longIndex(com.github.rnewson.couchdb.lucene.IntegrationTest)
index(com.github.rnewson.couchdb.lucene.IntegrationTest)
initializationError(com.github.rnewson.couchdb.lucene.LuceneTest)

Tests run: 28, Failures: 0, Errors: 3, Skipped: 1

If you want the surefire-reports files, I can email them to you...

Thanks,

Zach

last_modified values passed back as large integer in json

I'm using collectd's new curl_json plugin to pull couchdb stats and wanted to add some for couchdb-lucene. The plugin fails to parse the json returned by couchdb-lucene because the last_modified value is passed back as an integer that is too large. The actual failure is in yajl, the json parsing library used by the curl_json plugin.

The collectd config I was using:

<URL "http://localhost:5984/history/_fti">
  Instance "lucene"
  <Key "/doc_count">
    Type "gauge"
  </Key>
  <Key "/doc_del_count">
    Type "counter"
  </Key>
  <Key "/disk_size">
    Type "bytes"
  </Key>
</URL>

The error from collectd:

2009-10-06_18:31:41.38722 [2009-10-06 18:31:41] curl_json plugin: yajl_parse failed: parse error: integer overflow
2009-10-06_18:31:41.38723           roup","room"],"last_modified":1254853876000,"optimized":fals
2009-10-06_18:31:41.38724                      (right here) ------^

The data returned from the curl:
{"current":true,"disk_size":4939405,"doc_count":42658,"doc_del_count":2,"fields":["body","group","room"],"last_modified":1254853876000,"optimized":false}

I noticed that couchdb returns its instance_start_time statistic as a string in json instead of an integer and was wondering if the same could be done for couchdb-lucene. For now, I'm just using a proxy script to do that conversion for me.

Versions:

couchdb-lucene 0.4
couchdb 0.9.1
collectd 4.8.0
yajl trunk

Problem with couchdb-lucene 0.4 and couchdb 0.11.0b820580

I've trying to reinstall my whole system. I get couchdb-lucene from git and couchdb from svn. I've made an index (exactly the same that its in the documentation).
And I get:

[info] [<0.98.0>] 127.0.0.1 - - 'GET' /fisica-nist3/_all_docs_by_seq?startkey=0&limit=250&include_docs=true 404
[couchdb-lucene] WARN no rows found ({"error":"not_found","reason":"missing"}).

Probably something changed in the last couchdb?

Ability for pausing the indexer

As already mentioned / discussed in IRC it would be really cool to be able to pause the indexer. This would be useful when inserting or updating a hugh number of documents to increase (write) performance. I thought of something like: a) stop c-l indexer b) insert/update few ten-thousand docs c) re-start indexer.

couchdb-lucene pegs CPU on Leopard 10.5.7

Followed instructions line by line. As soon as the java -index proces begins it pegs the CPUs, even if there are no databases/documents present in CouchDB.

OS: Leopard 10.5.7
CouchDB: 0.9.1
Java: 1.5.0_19

Problem with couchdb-lucene 0.5 and couchdb 0.11.0b820580

I've tried to move to couchdb-lucene 0.5 and I get the next error. I've Rhino 1.7R1 and mozjs 1.9.0.

2009-10-02 09:08:18,511 [couchdb-lucene] INFO Indexing fisica-nist3/lucene/by_subject from scratch.
[info] [<0.98.0>] 127.0.0.1 - - 'GET' /fisica-nist3/_changes?since=0&limit=250&include_docs=true 200
2009-10-02 09:08:18,590 [couchdb-lucene] ERROR Error updating index.
java.lang.RuntimeException: Invalid object type: org.mozilla.javascript.NativeObject
at com.github.rnewson.couchdb.lucene.Rhino.map(Rhino.java:129)
at com.github.rnewson.couchdb.lucene.Index$Indexer.updateDatabase(Index.java:267)
at com.github.rnewson.couchdb.lucene.Index$Indexer.updateIndex(Index.java:194)
at com.github.rnewson.couchdb.lucene.Index$Indexer.access$100(Index.java:51)
at com.github.rnewson.couchdb.lucene.Index.main(Index.java:347)
at com.github.rnewson.couchdb.lucene.Main.main(Main.java:33)
2009-10-02 09:08:19,070 [couchdb-lucene] INFO Committed changes to index (1 documents in index, 0 deletes).

log config option not working?

I'm using rev 9fa591c (master) and I'm getting some errors that make me think the log config option in the .ini file isn't being respected:

https://gist.github.com/a6d8c7c36581bb20b72c

Any help would be very much appreciated.

Memory Issue

Robert,

still seeing the memory issue. It doesnt seem to be related to the couchdb.lucene.ram. I have set it to 512MB, it does however seem to be related to the -Xms -Xmx heap size settings, more means it goes for longer. Right now, I have set it up to build the lucene view as we copy/update documents over. That seems to be working for now.

[PATCH] 'info' requests do not work correctly on latest c-l 0.5

The 'info' request in the python external hook should look like:

path = '/'.join(['', 'info', host, str(port)] + path)

EDIT: Removed inline patch