Giter Site home page Giter Site logo

healthonnet / hon-lucene-synonyms Goto Github PK

View Code? Open in Web Editor NEW
149.0 149.0 66.0 280 KB

Solr query parser plugin that performs proper query-time synonym expansion.

Home Page: http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr

Java 100.00%

hon-lucene-synonyms's People

Contributors

atuljangra avatar avlukanin avatar janhoy avatar joekiller avatar mykezero avatar nolanlawson avatar oweneustice avatar rpialum avatar softwaredoug avatar yonas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hon-lucene-synonyms's Issues

Add a threshold for alternative queries

We've experienced a coupe site outages because the plugin was trying to create millions of alternative queries. Functionally wise, the plugin was trying to do the right thing.

However, a "bad" synonym and an edge case query ended up in ~400M alternative queries which brought down our cluster.

The solution we are proposing and deploying to our cluster is to add a threshold on how many alternatives queries a query can be expanded to. If the threshold is exceeded we are planning to halt the expansion and use the original query only. A warning will be logged which we plan to use to send an alert to our merchandising and engineering teams.

If anyone has a better suggestion we'll be happy to help creating a patch.

Here is some test data to reproduce a similar issue. Bare with me on this example even though it doesn't make a lot of sense:

i o,i os,ios,io s,io=>i ox,i o,i os, iox,ios,io s,io,io x,ioxs

Query:

[smith - POLARIZED 5 $187.96 $234.95 Item # SMI0905 20% Off Sale -White/Rose Copper/Extra Blue Sensor Mirror, One Size ($187.96) Size? Quantity add to cart FREE SHIPPING on orders over $50* 100% Guaranteed Returns Price Match Guarantee buy with confidence TECH SPECS Frame Material: flexible urethane Helmet Compatible: yes Eyeglass Compatible: no Polarized Lens: yes Ventilation: Vaporator Lens Technology, Porex Filter Grip Strap: yes, silicone Face Size: small to medium Recommended Use: skiing, snowboarding Manufacturer Warranty: lifetime OTHER PEEPS PEEPED product title Smith IO Interchangeable Goggles with Bonus Lens From:$122.47 product title Smith IOX Interchangeable Goggle From:$131.21 product title Smith I/O Recon Goggle From:$519.96 product title Oakley Airbrake Goggle From:$121.00 product title Smith Lago Signature I/O Goggles From:$122.47 Smith I/OS Interchangeable Goggle - Polarized Current Color Available Colors/Styles Smith I/OS Interchangeable Goggle - Polarized White/Rose Copper/Extra Blue Sensor Mirror Detail Pics DESCRIPTION UNDENIABLY SLEEK LOOKS AND RIMLESS INTERCHANGEABILITY. Inspired by fashionable eyewear, designed to expose the truest picture of terrain in front of you, and sized for those with smaller faces, the polarized Smith I/OS Interchangeable Goggle is the luxury sports car of the goggle world. The minimal frame design creates a lightweight, comfortable feel while the easy-to-use (even with gloves on your hands) interchangeable clips allow you to adapt to conditions as quickly as they change. Flexible, minimally designed frame eliminates snow-clog on the lens edges and creates a sleek look that seamlessly fuses with your Smith helmet Carbonic-X lens with TLT Optics has a spherical shape that mimics the shape of your eye to eliminate distortion and open your peripheral view—premium scoutability for lines and possible hazards Quick-release interchangeable lens system works in conjunction with the minimal frame design—use the dual top clips to change the lens and keep up with varied conditions Vaporator Lens Technology—a dual-layer lens with a Porax Filter to keep the moisture out and prevent fogging DriWix dual-layer face foam features a soft backing layer at the frame and a supple fleece lining against your face to absorb sweat and wick it away Silicon backing keeps the adjustable, quick-clip strap stuck to your helmet or hat while the pivoting side clips allow the goggle to move with your face—no pressure points Includes two lenses and a microfiber goggle bag with a separate sleeve for a single lens What do you think of the Smith I/OS Interchangeable Goggle - Polarized? Share a... Write a review Ask a question Share a photo Share a video Hide detailed information YOUR COMMUNITY CONTRIBUTIONS Everything Reviews Photos Videos]

Having the original term in one field and the synonym in the other gets higher score than having the original term in both fields

I've a single synonym pair:
laptop,dizüstü
And I have 2 documents with to fields:
<doc>
<field name="id">01</field>
<field name="title">laptop</field>
<field name="content">laptop</field>
</doc>
<doc>
<field name="id">02</field>
<field name="title">laptop</field>
<field name="content">dizüstü</field>
</doc>

When q=laptop, the document with id=02 always has a higher score. Even I give synonyms.originalBoost=10 or give different weights for the fields, it doesn't change. I thought since the it has the original term in both fields, its score would be higher. It seems having the original term in one field and the synonym in another field is a positive boost factor. Am I wrong?

Here are the parsedQuery value (tie=0.1):
+((title:laptop^1.5 | content:laptop)~0.1^1.2 ((+(title:dizüstü^1.5 | content:dizüstü)~0.1 () ()))) () ()

Multi-word synonyms are unaffected by slop settings

Environment: Mac OS X, Solr 4.1.0

Steps to reproduce: Follow the "Getting Started" instructions, then enter the following test data:

URL='http://localhost:8983/solr/update'

curl $URL -H "Content-Type: text/xml" -d '<delete><query>*:*</query></delete>'

curl $URL/json -H 'Content-type:application/json' -d '
[ { 
    "id"   : "1",
    "name" : "dog"
  },
  { 
    "id"   : "2",
    "name" : "pooch"
  },
  { 
    "id"   : "3",
    "name" : "hound"
  },
  { 
    "id"   : "4",
    "name" : "canis familiaris"
  },
  { 
    "id"   : "5",
    "name" : "canis"
  },
  { 
    "id"   : "6",
    "name" : "familiaris"
  },
  { 
    "id"   : "7",
    "name" : "familiaris canis"
  } ]'
curl "$URL/?commit=true"

Then browse to the url:

http://localhost:8983/solr/select/?q=dog&debugQuery=on&qf=text&defType=synonym_edismax&synonyms=true

Expected result: "pooch","dog","hound",and "canis familiaris" are matched.

Actual result: In addition to these 4, "familiaris canis" is also matched, because the query parser doesn't construct the multi-word synonym as a phrase query.

Add support for wildcards

I'm running solr 4.1 with your plugin and everything works just fine when no wildcards are present in the query. But if I do the following query: q=slimdown ... only one result is found, instead of 10, for eg. My synonym file looks like this:
weight loss, slimdown
However, solr returns 10 results when query has no wildcard(s).
What could be the problem?

Bug in #41 fix for Solr 5.3.1

Working on an upgrade to 5.3.1. Bug #41 was fixed but with 5.3.1 another bug appears. The last test fails because the result is:

['"man\'s best friend"', 'canis familiaris', 'dog', 'hound', 'pooch']

instead of

['"canis familiaris"', '"man\'s best friend"', 'canis familiaris', 'dog', 'hound', 'pooch']

Here is the output:

$ nosetests test/
.................F
======================================================================
FAIL: test_queries (015-test-issue-41.TestBasic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "hon-lucene-synonyms/test/015-test-issue-41.py
    self.tst_query('canis familiaris', 4)
  File "hon-lucene-synonyms/test/015-test-issue-41.py
    self.assertEqual(cnt, quote_cnt)
AssertionError: 2 != 4
-------------------- >> begin captured stdout << ---------------------
['"canis familiaris"', '"dog"', '"hound"', '"man\'s best friend"', '"pooch"']
Quotes found count =  10
['"canis familiaris"', '"dog"', '"hound"', '"man\'s best friend"', '"pooch"']
Quotes found count =  10
['"canis familiaris"', '"dog"', '"hound"', '"man\'s best friend"', '"pooch"']
Quotes found count =  10
['"canis familiaris"', '"dog"', '"hound"', '"man\'s best friend"', '"pooch"']
Quotes found count =  10
['"canis familiaris"', '"man\'s best friend"', 'dog', 'hound', 'pooch']
Quotes found count =  4
['"canis familiaris"', '"man\'s best friend"', 'dog', 'hound', 'pooch']
Quotes found count =  4
['"canis familiaris"', '"man\'s best friend"', 'dog', 'hound', 'pooch']
Quotes found count =  4
['"man\'s best friend"', 'canis familiaris', 'dog', 'hound', 'pooch']
Quotes found count =  2

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 18 tests in 2.058s

FAILED (failures=1)

Not sure the problem immediately but working on it...

expand=false behaviour when a phrase is given

I have this entry in my synonyms file:

new york => newyork

And I set expand to false in the config file. I expect this will replace "new york" with newyork from incoming queries but I found that both ('new york' and newyork) are being used. Is that expected behavior?

Thanks,
-Kee

Avoid luceneMatchVersion in config

It should be possible to resolve luceneMatchVersion from Solr itself and use that for all components being initialized, getting rid of this in config. With #19 you need to specify version for every analyzer component which is a bit verbose.

Incompatibility with Solr 3.6.0, 3.6.1, and 4.0

This may turn into a super-bug, but let's hope not.

First off, there is a problem accessing the queryFields private variable, which was fixed by this commit. Other problems abound, though, as detailed in these WordPress comments.

First off, there's this guy:

SEVERE: java.lang.IllegalAccessError: class org.apache.solr.search.SynonymExpand
ingExtendedDismaxQParser cannot access its superclass org.apache.solr.search.Ext
endedDismaxQParser

Then there's this guy:

————————— java.lang.IllegalAccessError: class 
org.apache.solr.search.SynonymExpandingExtendedDismaxQParser cannot access its 
superclass org.apache.solr.search.ExtendedDismaxQParser at 
java.lang.ClassLoader.defineClass1(Native Method) at 
java.lang.ClassLoader.defineClass(Unknown Source) at 
java.security.SecureClassLoader.defineClass(Unknown Source) at 
java.net.URLClassLoader.defineClass(Unknown Source) at 
java.net.URLClassLoader.access$100(Unknown Source) at

I can confirm seeing these same issues myself when I try to build using Solr 3.6.0 instead of 3.5.0.

Currently I'm still dumbfounded by these bugs. SynonymExpandingExtendedDismaxQParser is clearly a ExtendedDismaxQParser. It's even in the same package. So I have no idea why the class loader is complaining about this.

When this query parser finds synonyms, it needs the longest match.

I insert the synonyms for dog just like below.
When I search "dog", I want to search "dog" or "man's best friend" or "dog(inc)" and it works perpectly.
When I search "dog(inc), I want to search "dog(inc)" or "dog" or "man's best friend" too.
But this query parser finds synonyms for "dog(inc)" and "dog" also(maybe uses the shortest match). And It searches ("doc" and "inc") or ("doc's synonyms" and "inc").

hmm....
I think the search query has to be the longest matched in the synonym_edismax query parser.

# tokenizer 
 query : StatndardTokenizer
 synonym : StatndardTokenizer 
 dog(inc) -> dog inc
# synynoyms.txt
 dog, man's best friend, dog(inc)
# search phrase 
 search query : dog ==> OK
http://127.0.0.1:8983/solr/select?qf=Title_t&q=dog&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9
 ==> +((Title_t:dog)^1.1 (((+(Title_t:dog)) (+(Title_t:dog(inc))) (+(Title_t:"man's best friend")))^0.9))

 search query : dog(inc) ==> find dog's synonyms and make the AND search phrase with the dog's synonym and "inc". 
http://127.0.0.1:8983/solr/select?qf=Title_t&q=dog(inc)&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9
 ==> +((((Title_t:dog) (Title_t:inc))~2^1.1) (((+(((Title_t:dog) (Title_t:inc))~2)) (+(((Title_t:"dog inc") (Title_t:inc))~2)) (+(((Title_t:"man's best friend") (Title_t:inc))~2)))^0.9))

synonyms.synonymBoost is doubled

When I use synonyms.synonymBoost=0.7, this weight is doubled for synonymc. Check: http://localhost:8983/solr/collection1/select?q=dog&wt=xml&indent=true&debugQuery=true&qf=text&defType=synonym_edismax&synonyms=true&synonyms.synonymBoost=0.7

  <str name="parsedquery">(+(DisjunctionMaxQuery((text:dog)) ((((+(DisjunctionMaxQuery((text:canis)) DisjunctionMaxQuery((text:familiaris))))/no_coord))^0.7) ((((+DisjunctionMaxQuery((text:hound^0.7)))/no_coord))^0.7) ((((+(DisjunctionMaxQuery((text:man's)) DisjunctionMaxQuery((text:best)) DisjunctionMaxQuery((text:friend))))/no_coord))^0.7) ((((+DisjunctionMaxQuery((text:pooch^0.7)))/no_coord))^0.7)))/no_coord</str>
  <str name="parsedquery_toString">+((text:dog) (((+((text:canis) (text:familiaris))))^0.7) (((+(text:hound^0.7)))^0.7) (((+((text:man's) (text:best) (text:friend))))^0.7) (((+(text:pooch^0.7)))^0.7))</str>
  <lst name="explain"/>
  <arr name="queryToHighlight">
    <str>org.apache.lucene.search.BooleanClause:(text:dog)</str>
    <str>org.apache.lucene.search.BooleanClause:((+((text:canis) (text:familiaris))))^0.7</str>
    <str>org.apache.lucene.search.BooleanClause:((+(text:hound^0.7)))^0.7</str>
    <str>org.apache.lucene.search.BooleanClause:((+((text:man's) (text:best) (text:friend))))^0.7</str>
    <str>org.apache.lucene.search.BooleanClause:((+(text:pooch^0.7)))^0.7</str>
  </arr>

But if I add pf, it works OK: http://localhost:8983/solr/collection1/select?q=dog&wt=xml&indent=true&debugQuery=true&qf=text&defType=synonym_edismax&synonyms=true&synonyms.synonymBoost=0.7&pf=text^1.2

Synonyms are scored more harshly when there are > 2

Consider the synonyms:

dog,hound,pooch

And a document containing:

dog

We would expect a query for dog, pooch, and hound to arrive at the same score, assuming synonyms.originalBoost and synonyms.synonymBoost are both set to 1.0.

However, hound and pooch are scored less than dog. Apparently this has to do with the DisjunctionMaxQuery math and the fact that there are more than two synonyms. E.g. when the synonyms are just:

dog,hound

the scores are equal.

mm parameter isn't working properly with spellcheck component

In the configuraiont of the /select request handler if I use spellcheck default parameters and add the
75% parameter, solr stucks when running a query: Here is the full config of the request handler:

<requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
        <str name="echoParams">explicit</str>
        <str name="defType">synonym_edismax</str>
        <int name="timeAllowed">1000</int>
        <str name="qf">title^3 title_s^2 content</str>       
        <str name="pf">title content</str>       
        <str name="fl">id,title,content,score</str>       
        <float name="tie">0.1</float>
        <str name="lowercaseOperators">true</str>
        <str name="stopwords">true</str>
        <int name="rows">10</int>

        <str name="synonyms">true</str>
        <str name="synonyms.originalBoost">1.5</str>
        <str name="synonyms.synonymBoost">1.1</str>
        <str name="synonyms.disablePhraseQueries">true</str>       

        <str name="mm">75%</str> <!-- If I uncomment this line or the following block, it works -->

        <str name="spellcheck">on</str>
        <str name="spellcheck.dictionary">default</str>
        <str name="spellcheck.dictionary">wordbreak</str>
        <str name="spellcheck.onlyMorePopular">true</str>
        <str name="spellcheck.count">5</str>
        <str name="spellcheck.maxResultsForSuggest">5</str>
        <str name="spellcheck.extendedResults">false</str>
        <str name="spellcheck.alternativeTermCount">2</str>
        <str name="spellcheck.collate">true</str>
        <str name="spellcheck.collateExtendedResults">false</str>
        <str name="spellcheck.maxCollationTries">5</str>
        <str name="spellcheck.maxCollations">3</str>
    </lst>

    <arr name="last-components">
    <str>spellcheck</str>
    </arr>   
</requestHandler>

Any idea? Thanks

synonyms.disablePhraseQueries is ignored if using WhitespaceTokenizer

Unfortunately, the fixes I made for #9 and #26 (to use the WhitespaceTokenizer instead of the StandardTokenizer as the default config) also cause phrases to never be expanded (since the quotes are considered part of the word).

I'm thinking the best solution to this will be a somewhat hacky workaround using Token Filters to remove the quotes, since the WhitespaceTokenizer seemed to work so well in just about every other case.

Too much memory consumed on queries matching many synonym groups

Hello! I've noticed with the plugin a problem like what was happening in issue #38.

If a query is broad - with matches on many synonym groups - then the plugin will start expanding too many synonyms which causes the Java heap memory to be consumed too rapidly, eventually resulting in a out of memory exception. Even if this query succeeds, the request takes a long time to process (request time > 5 seconds).

The linked issues suggests using the synonyms.bag=true flag which seems to keep the memory usage down, but are there any downsides to using that flag?

I've tested this under Solr.6.0.0 with the hon-lucene-synonyms-5.0.5.jar file. Here is the query and synonyms that seem to trigger this problem.

Query

http://localhost:8983/solr/test/select?q="bobcat pup tortoise bunny angle toad"&defType=synonym_edismax&synonyms=true&debugQuery=true

Synonyms.txt

# Cats
bobcat, cheetah, cougar, jaguar, kitten, kitty, leopard, lion, lynx, mouser, ocelot, panther, puma, puss, pussy, tabby, tiger, tom, tomcat, grimalkin, malkin

# Dogs
pup, puppy, bitch, cur, doggy, hound, mongrel, mutt, pooch, stray, tyke, bowwow, fido, flea bag, man's best friend, tail-wagger

# Turtles
tortoise, chelonian, cooter, leatherback, loggerhead, slowpoke, snapper, terrapin, testudinal

# Rabits
bunny, hare, rodent, buck, capon, chinchilla, coney, cony, cottontail, cuniculus, doe, lagomorph, lapin

# Fish
angle, bait, bob, cast, chum, extract, extricate, find, net, produce, seine, trawl, troll, bait the hook, cast one's hook, cast one's net, go fishing, haul out, pull out

# Frogs
toad, bullfrog, caecilian, croaker, polliwog

BooleanQuery$TooManyClauses

Hi,

we love your synonym component. Thank you for that.
Unfortunately we have some issues with the created synonyms. Its very common that the created boolean queries are very high and often even more than 1024.

We have synonyms for numbers 1 to 10:

1, i, l, eins
2, ii, ll, zwei
3, iii, lll, drei
4, iv, lv, vier
5, v, fünf, fuenf
6, vi, vl, sechs
7, vii, vll, sieben
8, viii, vlll, acht
9, ix, lx, neun
10, x, zehn

We have set expand=true

A search for "harry potter 1 2 3 4 5 6" will create many boolean queries - resulting in either the exception "BooleanQuery$TooManyClauses" and in extreme cases it will force a full garbage collection because so many objects are created and the young and old generation are filled up in seconds.

I assume our synonym files are just stupid wrong. :-( Do you have any suggestions?
Is there a option to limit created synonyms?

Thank you!

Limit synonyms?

After importing more then ±8500 synonyms I got this error:
You defined >1 synonym analyzer in your configuration, but you left synonyms.analyzer empty.

Now I deleted around 7000 synonyms (8403 synonyms left) the error disappeared.

I do use => on all of the words in the database. That's because synonyms are not allowed to work vice versa.

Port Tests to Java

In the interest of maintenance I propose that all tests be migrated to java.

Results not matching tutorial with solr-4.3.0

When following the tutorial step by step using solr-4.3.0, the response obtained at step 8 is different from what is expected. Note the /no_coord clause instead of the ~2 or ~3 in the parsed query and also the presence of ExtendedDisMaxParser instead of SynonymExpandingExtendedDismaxQParser in the QParser slot:

<response>
    ...
    <result name="response" numFound="0" start="0"/>
    <lst name="debug">
        <str name="rawquerystring">dog</str>
        <str name="querystring">dog</str>
        <str name="parsedquery">
(+(DisjunctionMaxQuery((text:dog)) (((+(DisjunctionMaxQuery((text:canis)) DisjunctionMaxQuery((text:familiaris))))/no_coord)) (((+DisjunctionMaxQuery((text:hound)))/no_coord)) (((+(DisjunctionMaxQuery((text:man's)) DisjunctionMaxQuery((text:best)) DisjunctionMaxQuery((text:friend))))/no_coord)) (((+DisjunctionMaxQuery((text:pooch)))/no_coord))))/no_coord
</str>
        <str name="parsedquery_toString">
+((text:dog) ((+((text:canis) (text:familiaris)))) ((+(text:hound))) ((+((text:man's) (text:best) (text:friend)))) ((+(text:pooch))))
</str>
        <lst name="explain"/>
        <arr name="queryToHighlight">
            <str>org.apache.lucene.search.BooleanClause:(text:dog)</str>
            <str>
org.apache.lucene.search.BooleanClause:(+((text:canis) (text:familiaris)))
</str>
            <str>
org.apache.lucene.search.BooleanClause:(+(text:hound))
</str>
            <str>
org.apache.lucene.search.BooleanClause:(+((text:man's) (text:best) (text:friend)))
</str>
            <str>
org.apache.lucene.search.BooleanClause:(+(text:pooch))
</str>
        </arr>
        <arr name="expandedSynonyms">
            <str>canis familiaris</str>
            <str>dog</str>
            <str>hound</str>
            <str>man's best friend</str>
            <str>pooch</str>
        </arr>
        <lst name="mainQueryParser">
            <str name="QParser">ExtendedDismaxQParser</str>
            <null name="altquerystring"/>
            <null name="boost_queries"/>
            <arr name="parsed_boost_queries"/>
            <null name="boostfuncs"/>
        </lst>
        <lst name="synonymQueryParser">
            <str name="QParser">ExtendedDismaxQParser</str>
            <null name="altquerystring"/>
            <null name="boost_queries"/>
            <arr name="parsed_boost_queries"/>
            <null name="boostfuncs"/>
        </lst>
        ...
</response>

instead of

<response>
  ...
  <result name="response" numFound="0" start="0"/>
  <lst name="debug">
    <str name="rawquerystring">dog</str>
    <str name="querystring">dog</str>
    <str name="parsedquery">
        +(DisjunctionMaxQuery((text:dog)) (((DisjunctionMaxQuery((text:canis)) 
        DisjunctionMaxQuery((text:familiaris)))~2) DisjunctionMaxQuery((text:hound)) 
        ((DisjunctionMaxQuery((text:man's)) DisjunctionMaxQuery((text:best)) 
        DisjunctionMaxQuery((text:friend)))~3) DisjunctionMaxQuery((text:pooch))))
    </str>
    <str name="parsedquery_toString">
        +((text:dog) ((((text:canis) (text:familiaris))~2) (text:hound) 
        (((text:man's) (text:best) (text:friend))~3) (text:pooch)))
    </str>
    <lst name="explain"/>
    <str name="QParser">SynonymExpandingExtendedDismaxQParser</str>
    ...
  </lst>
</response>

Is there a missing step and should the tutorial be modified to reflect this?

synonym and bq duplicate bug for solr 3.x and 4.x

Hello

I have noticed that when using the edismax bq parameter, the bq gets applied twice: once to the synonym expansion and once as an addition to the overall score.

I have a fresh download of solr 4.5.

I have a synonym file containing only one line:

usa, united states of america

I have the following 2 documents that I index into solr using the post.jar:

[
   {
      "id":"1",
      "name":"Enterprises in the USA.",
      "cat":"Y"
   },
   {
      "id":"2",
      "name":"Enterprises in the United States of America.",
      "cat":"Y"
   }
]

when I search for "USA" using the following query:

http://localhost:8983/solr/collection1/select?q="usa"&pf=name&qf=name&bq=cat:Y^1000&debugQuery=true&defType=synonym_edismax&synonyms=true

I get back in the debug query

<str name="parsedquery">(+(DisjunctionMaxQuery((name:usa)) (((+DisjunctionMaxQuery((name:"united states of america")) () cat:Y^1000.0)/no_coord))) () cat:Y^1000.0)/no_coord</str>

<str name="parsedquery_toString">+((name:usa) ((+(name:"united states of america") () cat:Y^1000.0))) () cat:Y^1000.0</str>

meaning that doc 2 comes first as the expansion "united states of america" gets one extra bq boost.

Note the duplicate cat:Y^1000.0 in the parsedQuery. The first of which may be the actual bug?

Same when I search for "United States of America":

http://localhost:8983/solr/collection1/select?q="united states of america"&pf=name&qf=name&bq=cat:Y^1000&debugQuery=true&defType=synonym_edismax&synonyms=true

I get

<str name="parsedquery">(+(DisjunctionMaxQuery((name:"united states of america")) (((+DisjunctionMaxQuery((name:usa)) () cat:Y^1000.0)/no_coord))) () cat:Y^1000.0)/no_coord</str>

<str name="parsedquery_toString">+((name:"united states of america") ((+(name:usa) () cat:Y^1000.0))) () cat:Y^1000.0</str>

meaning that doc 1 which contains the expansion "USA" get one additional bq boost.

In general, the bug is that the synonym expansion get a duplicate bq boost.

Note that none of
synonyms.originalBoost or synonyms.synonymBoost is being used.

I have seen this issue in solr 3.6 as well as 4.5.

Incompatibility with Solr 4.0.0 and 4.1.0

Welp, I thought my compatibility problems were done, but apparently I should have done my homework better, because Solr 4.0.0 and 4.1.0 are broken.

Following the directions in the "Getting Started", for both versions I'm greeted with:

SEVERE: Unable to create core: collection1
org.apache.solr.common.SolrException
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:794)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:607)
    at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:1003)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1033)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.AbstractMethodError
    at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:614)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:788)
    ... 13 more

Jan 30, 2013 2:53:12 PM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException: Unable to create core: collection1
    at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
    at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.solr.common.SolrException
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:794)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:607)
    at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:1003)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1033)
    ... 10 more
Caused by: java.lang.AbstractMethodError
    at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:614)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:788)
    ... 13 more

Prevent synonyms from being stemmed

Normally this is done by putting solr.SynonymFilterFactory after solr.SnowballPorterFilterFactory in the chain.

How can I stem normal queries and not stem synonyms?

NullPointerException in StandardTokenizerImpl in Solr 4.1.0

This bug only seems to occur in Solr 4.1.0. Affects version 1.2.

Steps to reproduce: follow the "Getting Started" directions for Solr 4.1.0, and punch in the URL as directed. The response is:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<response>
<lst name="responseHeader">
<int name="status">500</int>
<int name="QTime">6669</int>
<lst name="params">
<str name="debugQuery">on</str>
<str name="q">dog</str>
<str name="qf">text</str>
<str name="synonyms">true</str>
<str name="defType">synonym_edismax</str>
</lst>
</lst>
<lst name="error">
<str name="trace">
...
</str>
<int name="code">500</int>
</lst>
</response>

And the stacktrace is:

SEVERE: null:java.lang.NullPointerException
    at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
    at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
    at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180)
    at org.apache.lucene.analysis.shingle.ShingleFilter.getNextToken(ShingleFilter.java:367)
    at org.apache.lucene.analysis.shingle.ShingleFilter.shiftInputWindow(ShingleFilter.java:421)
    at org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:286)
    at org.apache.lucene.analysis.synonym.SynonymFilter.parse(SynonymFilter.java:358)
    at org.apache.lucene.analysis.synonym.SynonymFilter.incrementToken(SynonymFilter.java:624)
    at org.apache.solr.search.SynonymExpandingExtendedDismaxQParser.generateSynonymQueries(SynonymExpandingExtendedDismaxQParserPlugin.java:359)
    at org.apache.solr.search.SynonymExpandingExtendedDismaxQParser.attemptToApplySynonymsToQuery(SynonymExpandingExtendedDismaxQParserPlugin.java:273)
    at org.apache.solr.search.SynonymExpandingExtendedDismaxQParser.parse(SynonymExpandingExtendedDismaxQParserPlugin.java:262)
    at org.apache.solr.search.QParser.getQuery(QParser.java:142)
    at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:117)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:365)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
    at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
    at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
    at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
    at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Thread.java:680)

Simplify configuration

Currently the XML configuration contains a lot of boilerplate (e.g. this example). A suggestion was made to change it to something more simplified, such as:

<queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
  <str name="defaultDict">english</str>
  <lst name="dictionaries">
    <lst name="english">
      <str name="fieldType">synonym_type_en</str>
      <str name="useForFields">title *_en</str>
    </lst>
    <lst name="addresses">
      <str name="fieldType">synonym_type_addr</str>
      <str name="useForFields">street city state</str>
    </lst>
  </lst>
</queryparser>

Probably we still need some way to specify some of the parameters (such as the shingle sizes), but I can choose some sensible defaults such that most of the parameters wouldn't be necessary.

This query parser uses the original query itself twice.

This query parser uses raw query itself twice.
For example, I search the query "ny" which has synonyms. And I check the query phrase, I find the "ny" twice on that.
The query "new york" is same to that. When I check the score of matched documents, those get advantages.
I think, raw query doesn't need to be searched in the synonym search phrase. Please consider about it.

$ tail synonyms.txt
Television, Televisions, TV, TVs

notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming

after us won't split it into two words.

Synonym mappings can be used for spelling correction too

pixima => pixma
new york, nyc,ny, new york city
dog => hound, pooch, canis familiaris, man's best friend
fc => football club
ml => major league

http://127.0.0.1:8983/solr/select?qf=Title_t&q=ny%20beaches&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true

+((((Title_t:ny) (Title_t:beaches))~2) ((+(((Title_t:"new york city") (Title_t:beaches))~2)) (+(((Title_t:"new york") (Title_t:beaches))~2)) (+(((Title_t:ny) (Title_t:beaches))~2)) (+(((Title_t:nyc) (Title_t:beaches))~2))))

http://127.0.0.1:8983/solr/select?qf=Title_t&q=new%20york%20beaches&defType=synonym_edismax&synonyms=true&debugQuery=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&q.op=AND&synonyms.constructPhrases=true

+((((Title_t:new) (Title_t:york) (Title_t:beaches))~3^1.1) (((+(((Title_t:"new york city") (Title_t:beaches))~2)) (+(((Title_t:"new york") (Title_t:beaches))~2)) (+(((Title_t:ny) (Title_t:beaches))~2)) (+(((Title_t:nyc) (Title_t:beaches))~2)))^0.9))

7.8 = (MATCH) sum of:

3.3000002 = (MATCH) sum of:
1.1 = (MATCH) weight(Title_t:new in 10044) [MinimalScoreDefaultSimilarity],
result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 1.1 = queryWeight, product of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044,
product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
1.1 = (MATCH) weight(Title_t:york in 10044) [MinimalScoreDefaultSimilarity],
result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 1.1 = queryWeight, product of: 1.0 = idf(docFreq=46, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044, product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=46, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
1.1 = (MATCH) weight(Title_t:beaches in 10044) [MinimalScoreDefaultSimilarity],
result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 1.1 = queryWeight,
product of: 1.0 = idf(docFreq=5, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=5, maxDocs=144370) 1.0 = fieldNorm(doc=10044)

4.5 = (MATCH) sum of:
4.5 = (MATCH) sum of:
3.6 = (MATCH) weight(Title_t:"new york" in 10044) [MinimalScoreDefaultSimilarity],
result of: 3.6 = score(doc=10044,freq=1.0 = phraseFreq=1.0),
product of: 1.8 = queryWeight,
product of: 2.0 = idf(),
sum of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = idf(docFreq=46, maxDocs=144370) 0.9 = queryNorm 2.0 = fieldWeight in 10044,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 2.0 = idf(), sum of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = idf(docFreq=46, maxDocs=144370) 1.0 = fieldNorm(doc=10044)
0.9 = (MATCH) weight(Title_t:beaches in 10044) [MinimalScoreDefaultSimilarity],
result of: 0.9 = score(doc=10044,freq=1.0 = termFreq=1.0),
product of: 0.9 = queryWeight,
product of: 1.0 = idf(docFreq=5, maxDocs=144370) 0.9 = queryNorm 1.0 = fieldWeight in 10044, product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=5, maxDocs=144370) 1.0 = fieldNorm(doc=10044)

IllegalAccessError in Solr 3.6.1

My writeup of the problem:

It looks like in Solr 3.5.0, ExtendedDismaxQParserPlugin defines "queryFields" to be package-private, whereas in Solr 3.6.1, it becomes private. So my code fails because I try to access the superclass's "queryFields" (here).

One user's report on the problem:

My environment :
JAVA : java version “1.6.0_18″ (OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1))
TOMCAT : 7.0.23
SOLR : 3.6.1
I then added the tag metadata in the /opt/tomcat-master-dev/conf/web.xml file :
The full Java stack Trace :
11 déc. 2012 08:54:31 org.apache.solr.common.SolrException log
GRAVE: java.lang.IllegalAccessError: class org.apache.solr.search.SynonymExpandingExtendedDismaxQParser cannot access its superclass org.apache.solr.search.ExtendedDismaxQParser
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:615)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:334)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:388)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:419)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:441)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1612)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1606)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1639)
at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1556)
at org.apache.solr.core.SolrCore.(SolrCore.java:555)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:480)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:332)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:216)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96)
at org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:277)
at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:258)
at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:382)
at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:103)
at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4624)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5281)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:866)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:842)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:615)
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:649)
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1581)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
The problem is still there…

ConstructPhrases not working in 1.3.4 with solr 4.2

Request

/select?debugQuery=on&mm=100%25&qs=0&hl=true&fl=*+score&defType=synonym_edismax&synonyms=true&synonyms.constructPhrases=true&f.content_texts.hl.snippets=3&start=0&hl.fragsize=0&q.op=AND&fq=type:(Doj+OR+Jurisprudence+OR+Law+OR+Taxation)&fq=abouts_im:(3)&fq=edited_b:true&hl.maxAlternateFieldLength=-1&rows=20&synonyms.analyzer=defaultAnalyzer&hl.simple.pre=@@@hl@@@&q=ra+8424&hl.mergeContiguous=true&hl.simple.post=@@@endhl@@@&qf=reference_number_texts^10.0&hl.maxAnalyzedChars=-1&f.content_texts.hl.fragsize=100&hl.fl=reference_number_texts&hl.fl=title_texts&hl.fl=ponente_texts&hl.fl=short_title_texts&hl.fl=content_texts&f.content_texts.hl.mergeContiguous=true&wt=xml&indent=true

Schema

<fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.LowerCaseFilterFactory"/>-->
        <tokenizer class="solr.ClassicTokenizerFactory"/>
        <!--<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" types="word-delim-types.txt" />-->
        <filter class="solr.LengthFilterFactory" min="1" max="100"/>
      </analyzer>
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <tokenizer class="solr.ClassicTokenizerFactory"/>
        <!--<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" types="word-delim-types.txt" />-->
        <filter class="solr.LengthFilterFactory" min="1" max="100"/>
      </analyzer>
    </fieldType>

Config

<!-- Main configuration for a synonym-expanding ExtendedDismaxQParserPlugin.  See
     https://github.com/healthonnet/hon-lucene-synonyms
     for more info on this plugin.
  -->
  <queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
    <!-- You can define more than one synonym analyzer in the following list.
         For example, you might have one set of synonyms for English, one for French,
         one for Spanish, etc.
      -->
    <lst name="synonymAnalyzers">
      <!-- Name your analyzer something useful, e.g. "analyzer_en", "analyzer_fr", "analyzer_es", etc.
           If you only have one, the name doesn't matter (hence "myCoolAnalyzer").
        -->
      <lst name="defaultAnalyzer">
        <!-- We recommend a PatternTokenizerFactory that tokenizes based on whitespace and quotes.
             This seems to work best with most people's synonym files.
             For details, read the discussion here: http://github.com/healthonnet/hon-lucene-synonyms/issues/26
          -->
        <lst name="tokenizer">
          <str name="class">solr.LowerCaseFilterFactory</str>
        </lst>
        <!-- The ShingleFilterFactory outputs synonyms of multiple token lengths (e.g. unigrams, bigrams, trigrams, etc.).
             The default here is to assume you don't have any synonyms longer than 4 tokens.
             You can tweak this depending on what your synonyms look like. E.g. if you only have unigrams, you can remove
             it entirely, and if your synonyms are up to 7 tokens in length, you should set the maxShingleSize to 7.
          -->
        <!--<lst name="filter">-->
          <!--<str name="class">solr.ShingleFilterFactory</str>-->
          <!--<str name="outputUnigramsIfNoShingles">true</str>-->
          <!--<str name="outputUnigrams">true</str>-->
          <!--<str name="minShingleSize">2</str>-->
          <!--<str name="maxShingleSize">4</str>-->
        <!--</lst>-->
        <!-- This is where you set your synonym file.  For the unit tests and "Getting Started" examples, we use example_synonym_file.txt.
             This plugin will work best if you keep expand set to true and have all your synonyms comma-separated (rather than =>-separated).
          -->
        <lst name="filter">
          <str name="class">solr.SynonymFilterFactory</str>
          <str name="tokenizerFactory">solr.LowerCaseFilterFactory</str>
          <str name="synonyms">synonyms.txt</str>
          <str name="expand">true</str>
          <str name="ignoreCase">true</str>
        </lst>
      </lst>
    </lst>
  </queryParser>

Synonyms

RA,R.A.,Rep. Act.,Republic Act

Debug

<lst name="debug">
<str name="rawquerystring">ra 8424</str>
<str name="querystring">ra 8424</str>
<str name="parsedquery">
(+(((DisjunctionMaxQuery((reference_number_texts:ra^10.0)) DisjunctionMaxQuery((reference_number_texts:8424^10.0)))~2) (((+((DisjunctionMaxQuery((reference_number_texts:a^10.0)) DisjunctionMaxQuery((reference_number_texts:8424^10.0)))~2))/no_coord)) (((+((DisjunctionMaxQuery((reference_number_texts:act^10.0)) DisjunctionMaxQuery((reference_number_texts:8424^10.0)))~2))/no_coord)) (((+((DisjunctionMaxQuery((reference_number_texts:r^10.0)) DisjunctionMaxQuery((reference_number_texts:8424^10.0)))~2))/no_coord)) (((+((DisjunctionMaxQuery((reference_number_texts:rep^10.0)) DisjunctionMaxQuery((reference_number_texts:8424^10.0)))~2))/no_coord)) (((+((DisjunctionMaxQuery((reference_number_texts:republic^10.0)) DisjunctionMaxQuery((reference_number_texts:8424^10.0)))~2))/no_coord))))/no_coord
</str>
<str name="parsedquery_toString">
+((((reference_number_texts:ra^10.0) (reference_number_texts:8424^10.0))~2) ((+(((reference_number_texts:a^10.0) (reference_number_texts:8424^10.0))~2))) ((+(((reference_number_texts:act^10.0) (reference_number_texts:8424^10.0))~2))) ((+(((reference_number_texts:r^10.0) (reference_number_texts:8424^10.0))~2))) ((+(((reference_number_texts:rep^10.0) (reference_number_texts:8424^10.0))~2))) ((+(((reference_number_texts:republic^10.0) (reference_number_texts:8424^10.0))~2))))
</str>
<lst name="explain">
<str name="Law 10608">
1.2898889 = (MATCH) product of: 3.8696666 = (MATCH) sum of: 1.9321347 = (MATCH) sum of: 0.18564129 = (MATCH) weight(reference_number_texts:act^10.0 in 2360) [DefaultSimilarity], result of: 0.18564129 = score(doc=2360,freq=1.0 = termFreq=1.0 ), product of: 0.10401606 = queryWeight, product of: 10.0 = boost 3.5694737 = idf(docFreq=11212, maxDocs=146430) 0.0029140448 = queryNorm 1.7847369 = fieldWeight in 2360, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.5694737 = idf(docFreq=11212, maxDocs=146430) 0.5 = fieldNorm(doc=2360) 1.7464935 = (MATCH) weight(reference_number_texts:8424^10.0 in 2360) [DefaultSimilarity], result of: 1.7464935 = score(doc=2360,freq=1.0 = termFreq=1.0 ), product of: 0.31904107 = queryWeight, product of: 10.0 = boost 10.948393 = idf(docFreq=6, maxDocs=146430) 0.0029140448 = queryNorm 5.4741964 = fieldWeight in 2360, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 10.948393 = idf(docFreq=6, maxDocs=146430) 0.5 = fieldNorm(doc=2360) 1.9375317 = (MATCH) sum of: 0.19103824 = (MATCH) weight(reference_number_texts:republic^10.0 in 2360) [DefaultSimilarity], result of: 0.19103824 = score(doc=2360,freq=1.0 = termFreq=1.0 ), product of: 0.1055172 = queryWeight, product of: 10.0 = boost 3.6209877 = idf(docFreq=10649, maxDocs=146430) 0.0029140448 = queryNorm 1.8104938 = fieldWeight in 2360, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.6209877 = idf(docFreq=10649, maxDocs=146430) 0.5 = fieldNorm(doc=2360) 1.7464935 = (MATCH) weight(reference_number_texts:8424^10.0 in 2360) [DefaultSimilarity], result of: 1.7464935 = score(doc=2360,freq=1.0 = termFreq=1.0 ), product of: 0.31904107 = queryWeight, product of: 10.0 = boost 10.948393 = idf(docFreq=6, maxDocs=146430) 0.0029140448 = queryNorm 5.4741964 = fieldWeight in 2360, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 10.948393 = idf(docFreq=6, maxDocs=146430) 0.5 = fieldNorm(doc=2360) 0.33333334 = coord(2/6)
</str>
<str name="Law 27326">
0.39704 = (MATCH) product of: 2.38224 = (MATCH) sum of: 2.38224 = (MATCH) sum of: 0.85405827 = (MATCH) weight(reference_number_texts:ra^10.0 in 17203) [DefaultSimilarity], result of: 0.85405827 = score(doc=17203,freq=1.0 = termFreq=1.0 ), product of: 0.23850794 = queryWeight, product of: 10.0 = boost 8.1847725 = idf(docFreq=110, maxDocs=146430) 0.0029140448 = queryNorm 3.580838 = fieldWeight in 17203, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 8.1847725 = idf(docFreq=110, maxDocs=146430) 0.4375 = fieldNorm(doc=17203) 1.5281818 = (MATCH) weight(reference_number_texts:8424^10.0 in 17203) [DefaultSimilarity], result of: 1.5281818 = score(doc=17203,freq=1.0 = termFreq=1.0 ), product of: 0.31904107 = queryWeight, product of: 10.0 = boost 10.948393 = idf(docFreq=6, maxDocs=146430) 0.0029140448 = queryNorm 4.7899218 = fieldWeight in 17203, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 10.948393 = idf(docFreq=6, maxDocs=146430) 0.4375 = fieldNorm(doc=17203) 0.16666667 = coord(1/6)
</str>
<str name="Law 27333">
0.39704 = (MATCH) product of: 2.38224 = (MATCH) sum of: 2.38224 = (MATCH) sum of: 0.85405827 = (MATCH) weight(reference_number_texts:ra^10.0 in 17210) [DefaultSimilarity], result of: 0.85405827 = score(doc=17210,freq=1.0 = termFreq=1.0 ), product of: 0.23850794 = queryWeight, product of: 10.0 = boost 8.1847725 = idf(docFreq=110, maxDocs=146430) 0.0029140448 = queryNorm 3.580838 = fieldWeight in 17210, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 8.1847725 = idf(docFreq=110, maxDocs=146430) 0.4375 = fieldNorm(doc=17210) 1.5281818 = (MATCH) weight(reference_number_texts:8424^10.0 in 17210) [DefaultSimilarity], result of: 1.5281818 = score(doc=17210,freq=1.0 = termFreq=1.0 ), product of: 0.31904107 = queryWeight, product of: 10.0 = boost 10.948393 = idf(docFreq=6, maxDocs=146430) 0.0029140448 = queryNorm 4.7899218 = fieldWeight in 17210, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 10.948393 = idf(docFreq=6, maxDocs=146430) 0.4375 = fieldNorm(doc=17210) 0.16666667 = coord(1/6)
</str>
</lst>
<arr name="queryToHighlight">
<str>
org.apache.lucene.search.BooleanClause:((reference_number_texts:ra^10.0) (reference_number_texts:8424^10.0))~2
</str>
<str>
org.apache.lucene.search.BooleanClause:(+(((reference_number_texts:a^10.0) (reference_number_texts:8424^10.0))~2))
</str>
<str>
org.apache.lucene.search.BooleanClause:(+(((reference_number_texts:act^10.0) (reference_number_texts:8424^10.0))~2))
</str>
<str>
org.apache.lucene.search.BooleanClause:(+(((reference_number_texts:r^10.0) (reference_number_texts:8424^10.0))~2))
</str>
<str>
org.apache.lucene.search.BooleanClause:(+(((reference_number_texts:rep^10.0) (reference_number_texts:8424^10.0))~2))
</str>
<str>
org.apache.lucene.search.BooleanClause:(+(((reference_number_texts:republic^10.0) (reference_number_texts:8424^10.0))~2))
</str>
</arr>
<arr name="expandedSynonyms">
<str>a 8424</str>
<str>act 8424</str>
<str>r 8424</str>
<str>ra 8424</str>
<str>rep 8424</str>
<str>republic 8424</str>
</arr>
<lst name="mainQueryParser">
<str name="QParser">ExtendedDismaxQParser</str>
<null name="altquerystring"/>
<null name="boost_queries"/>
<arr name="parsed_boost_queries"/>
<null name="boostfuncs"/>
</lst>
<lst name="synonymQueryParser">
<str name="QParser">ExtendedDismaxQParser</str>
<null name="altquerystring"/>
<null name="boost_queries"/>
<arr name="parsed_boost_queries"/>
<null name="boostfuncs"/>
</lst>
<arr name="filter_queries">
<str>type:(Doj OR Jurisprudence OR Law OR Taxation)</str>
<str>abouts_im:(3)</str>
<str>edited_b:true</str>
</arr>
<arr name="parsed_filter_queries">
<str>type:Doj type:Jurisprudence type:Law type:Taxation</str>
<str>abouts_im:3</str>
<str>edited_b:true</str>
</arr>
<lst name="timing">
<double name="time">570.0</double>
<lst name="prepare">
<double name="time">8.0</double>
<lst name="query">
<double name="time">8.0</double>
</lst>
<lst name="facet">
<double name="time">0.0</double>
</lst>
<lst name="mlt">
<double name="time">0.0</double>
</lst>
<lst name="highlight">
<double name="time">0.0</double>
</lst>
<lst name="stats">
<double name="time">0.0</double>
</lst>
<lst name="spellcheck">
<double name="time">0.0</double>
</lst>
<lst name="debug">
<double name="time">0.0</double>
</lst>
</lst>
<lst name="process">
<double name="time">562.0</double>
<lst name="query">
<double name="time">0.0</double>
</lst>
<lst name="facet">
<double name="time">0.0</double>
</lst>
<lst name="mlt">
<double name="time">0.0</double>
</lst>
<lst name="highlight">
<double name="time">0.0</double>
</lst>
<lst name="stats">
<double name="time">0.0</double>
</lst>
<lst name="spellcheck">
<double name="time">0.0</double>
</lst>
<lst name="debug">
<double name="time">268.0</double>
</lst>
</lst>
</lst>
</lst>

Why synonyms.constructPhrases is not working? Is my configuration wrong?

Thank you

Construct Phrases may result in double quotes when original search is quoted

There are a few issues that exist with the current constructPhrases logic when expanding synonyms. One of which can result in multiple quotes being applied when the original search is quoted.

ie: Search: "Internal Revenue Service" takes money
Synonyms: (IRS, tax service, internal revenue service)

Current results: "IRS" takes money; ""tax service"" takes money; ""internal revenue service"" takes money.

My proposed solution involves making a few fixes with in generateSynonymQueries(), all when SynonymDismaxParams.SYNONYMS_CONSTRUCT_PHRASES has been set to true.

  1. Only apply quotes when the synonym term is a phrase (more than one term).
  2. Only apply quotes when the synonym phrase is not already surrounded by quotes.

Changes:

Add to top of generateSynonymQueries():
String origQuery = getQueryStringFromParser();
int queryLen = origQuery.length();

// TODO: make the token stream reusable?
TokenStream tokenStream = synonymAnalyzer.tokenStream(SynonymDismaxConst.IMPOSSIBLE_FIELD_NAME,
new StringReader(origQuery));

Replace current phraseQuery if logic with:
if (constructPhraseQueries && typeAttribute.type().equals("SYNONYM") &&
termToAdd.contains(" "))
{
//Dont' Quote when original is already surrounded by quotes
if( offsetAttribute.startOffset()==0 ||
offsetAttribute.endOffset() == queryLen ||
origQuery.charAt(offsetAttribute.startOffset()-1)!='"' ||
origQuery.charAt(offsetAttribute.endOffset())!='"')
{
// make a phrase out of the synonym
termToAdd = new StringBuilder(termToAdd).insert(0,'"').append('"').toString();
}
}

If tokenized original query is different from keyword in synonym file, this query parser doesn't make synonym search phrase.

When I don't use KeywordTokenizer or WhitespaceTokenizer, sometimes the original query is different from the tokenized(parsed) query.
But...
If tokenized original query is different from the keyword in synonym file, this query parser doesn't find synonyms and make synonym search phrases.
I think that for synonym, this query parser has to check the tokenized original query and the original query too.

# tokenizer : StandardTokenizer
# synonyms.txt
血と骨,Blood and Bones
wi-fi, wifi
wifi ==> OK
http://10.141.15.112:8983/solr/select?qf=Title_t&q=wifi&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=0
+((Title_t:wifi)^1.1 (((+(Title_t:"wi fi")) (+(Title_t:wifi)))^0.9))

wi-fi ==> can't find synonyms
http://10.141.15.112:8983/solr/select?qf=Title_t&q=wi-fi&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=0
+(((Title_t:wi Title_t:fi)~2))
Blood and Bones => OK
http://10.141.15.112:8983/solr/select?qf=Title_t&q=Blood%20and%20Bones&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=0
+(((+(Title_t:blood) +(Title_t:bones))^1.1) (((+(Title_t:"blood and bones")) (+(Title_t:"血 と 骨")))^0.9))

血と骨 ==> can't find synonyms
http://10.141.15.112:8983/solr/select?qf=Title_t&q=%E8%A1%80%E3%81%A8%E9%AA%A8&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=0
+(((Title_t:血 Title_t:と Title_t:骨)~3)

血 と 骨 ==> can't find synonyms
http://10.141.15.112:8983/solr/select?qf=Title_t&q=%E8%A1%80%20%E3%81%A8%20%E9%AA%A8&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=0
+(((Title_t:血) (Title_t:と) (Title_t:骨))~3)

Distribute on maven?

Any reason not to push releases to maven? We could configure travis-ci to push on tagged builds to make releases easier.

Multi-word synonyms not been expandend as expected when using extra terms in the query

For example, lets take the following synonym definition:

back pack=>backpack

If I search for back pack the parser expands the query as expected

+((((text:back) (text:pack))~2) (((+(text:backpack^1.2)))^1.2))

However, I used another word in my query it doesn't:

+(((text:north) (text:face) (text:back) (text:pack))~4)

I have a fix that does what I'm expecting to:

+((((text:north) (text:face) (text:back) (text:pack))~4) (((+(((text:north) (text:face) (text:backpack))~3)))^1.2))

Or "north face back pack red" returns

+((((text:north) (text:face) (text:back) (text:pack) (text:red))~5) (((+(((text:north) (text:face) (text:backpack) (text:red))~4)))^1.2))

Problem with non-latin characters in synonyms.txt

Thank you Nolan for making this really cool plugin for better synonym handling. I installed it on my Solr 4.0 and had no problems getting it running.

However I seem to have stumbled across a problem with non-latin characters. I'm not completely sure that my problem is due to non-latin characters in the synonyms file, but I believe that my tests as described indicates it:

The contents of my synomym.txt file (file saved in UTF-8):

  brystforstørrelse, brystoperation, silikonebryster, bfo, brystforstørrende operation
  mop, abdominalplastik, maveplastik, maveopstramning

Notice that the first line of the synonym file contains some terms with the unicode character ø. The second line contains no special characters.

Upon testing the real life scenario, where sombody makes a search for bfo (a common abbreviation), I get this parsed result (notice how some of the tokens are missing a few characters near the end):

+((title:bfo) ((((title:brystforstør) (title:operation))~2) (title:brystoperation) (title:silikonebryst) (title:brystforstør))) () ()

To test my suspicion, that this has something to do with UTF-8 and character positions/offsets, I try to search for mop, invoking the second line in the synonym file.

This search returned this parsed result, with all words having the correct length, which further strengthens my suspicion:

+((title:mop) ((title:abdominalplastik) (title:maveopstramning) (title:maveplastik))) () ()

Could there be something to this, or am I doing something else wrong?

Thanks!

Compatibility with Solr 6.0.0?

Hi, do you know if this plugin is compatible with Solr 6.0.0?

This is the error message I'm receiving from solr:
java.lang.NoSuchMethodError: org.apache.lucene.search.BooleanQuery.getClauses()[Lorg/apache/lucene/search/BooleanClause;

The more extended trace pulled from the web ui logging tab I've added here:
https://gist.github.com/Mykezero/853a714e2840129d82231e4546047b05

I've tried using the 2.0.0 and, I think, the 1.3.5-4.3 version of the jar with the same result.

For the query has "-", synonym_edimax can't find synonyms.

For the query has "-", synonym_edimax doesn't find synonyms even though I set the WhitespaceKonizer for query and synonym tokenizer.

tokenizer : WhitespaceKonizer

# synonyms.txt
wi-fi,wifi

# query : wifi ==> OK
http://127.0.0.1:8983/solr/select?qf=Title_t&q=wifi&pf=Title_t&ps=0&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=1
+((Title_t:wifi)^1.1 ((+(Title_t:wi-fi) +(Title_t:wifi))^0.9))

# query : wi-fi ==> can't find synonyms.
http://127.0.0.1:8983/solr/select?qf=Title_t&q=wi-fi&pf=Title_t&ps=0&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=1
+(Title_t:wi-fi)
# synonyms.txt
e-learning,elearning

# query : elearning ==> OK
http://127.0.0.1:8983/solr/select?qf=Title_t&q=elearning&pf=Title_t&ps=0&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=1
+((Title_t:elearning)^1.1 ((+(Title_t:e-learning) +(Title_t:elearning))^0.9)

# query : e-learning ==>  can't find synonyms
http://127.0.0.1:8983/solr/select?qf=Title_t&q=e-learning&pf=Title_t&ps=0&defType=synonym_edismax&synonyms=true&debugQuery=true&q.op=AND&synonyms.constructPhrases=true&synonyms.originalBoost=1.1&synonyms.synonymBoost=0.9&rows=1
+(Title_t:e-learning)

Nullpointer at empty query string

getString() is called several times, but it may return null, causing nullpointer. I saw this during warmUp queries using :, in which is a matchAllQuery with empty query string.

I have a fix

TooManyClauses: maxClauseCount is set to 1024

After setting up a decent amount of synonyms a multi-word queries start to throw a subject error.

Has anyone experienced similar issues? Is increasing maxClauseCount the only way to deal with it?

Synonym expansion issue when a (first) term appears more than one in a synonym entry

There seems to be an expansion issue in existing synonym parser with below synonym, please see below the debug response

query :

http://localhost:8585/solr/select?q=fooaaa+bar&rows=0&wt=json&indent=true&debugQuery=true&defType=synonym_edismax&synonyms=true

Synonym : fooaaa, fooaaa bar

“debug”: {
“rawquerystring”: “fooaaa bar”,
“querystring”: “fooaaa bar”,
“parsedquery”: “(+((DisjunctionMaxQuery((textSpell:fooaaa)) DisjunctionMaxQuery((textSpell:bar))) (((+(DisjunctionMaxQuery((textSpell:fooaaa)) DisjunctionMaxQuery((textSpell:bar)) DisjunctionMaxQuery((textSpell:bar))))/no_coord)) (((+DisjunctionMaxQuery((textSpell:fooaaa)))/no_coord))))/no_coord”,
“parsedquery_toString”: “+(((textSpell:fooaaa) (textSpell:bar)) ((+((textSpell:fooaaa) (textSpell:bar) (textSpell:bar))) ((+(textSpell:fooaaa))))”,
“explain”: {},
“queryToHighlight”: [
“org.apache.lucene.search.BooleanClause:(textSpell:fooaaa) (textSpell:bar)”,
“org.apache.lucene.search.BooleanClause:(+((textSpell:fooaaa) (textSpell:bar) (textSpell:bar)))”,
“org.apache.lucene.search.BooleanClause:(+(textSpell:fooaaa))”
],
“expandedSynonyms”: [
“fooaaa”,
“fooaaa bar”,
“fooaaa bar bar”
].....
Let me know if you need more information on this issue.

Customize COMPLEX_QUERY_OPERATORS_PATTERN in query config

It would be nice to let users ability to override COMPLEX_QUERY_OPERATORS_PATTERN from query configuration.

In my case I had to spend quite a bit of time understanding why queries containing a dash are not expanded, e.g. vendor-managed is not expanded because the dash character is in COMPLEX_QUERY_OPERATORS_PATTERN .

I would like to be able to customize it from configuration instead of changing the source code.

Regards,
Yegor

Adding .jar to $SOLR_HOME generates class not found exception?

Hi - and thanks for the work on this great plugin.

I'm trying to use with Solr 5.4 and I saw the instruction that says:

Update: We have tested to run with the jar in $SOLR_HOME/lib as well, and it works (Jetty).

So, I took the defaults on the "for production" Solr install script and inside /etc/defaults/solr.in.sh my $SOLR_HOME is set to /var/solr/data. Therefore I added a /lib directory inside /data and dropped the hon-lucene-synonyms-2.0.0.jar file into /var/solr/data/lib -- then I restarted Solr.

Unfortunately I'm getting a class not found exception in the Solr logs...

Caused by: java.lang.ClassNotFoundException: com.github.healthonnet.search.SynonymExpandingExtendedDismaxQParserPlugin

Where should I put this file to work with Solr Cloud? Thanks in advance

Allow multiple synonym analyzers for multiple fields in a single query

I have one more question about synonym_edismax.
Actually we set the field types in schema.xml just like below and when we index/query for each field, just solr can work with those setting.

But for synonym_edismax, I can set the only one tokenizer. How shall I do synonym search for each field which has diffrent type(diffrent tokenized)?
I want to use solr.WhitespaceTokenizerFactory for the field "tag", and solr.StandardTokenizerFactory for the field "title", How can I do that?


...

Matches all docs if bf (Boost Function) present

If your edismax query involves bf parameter, and the query matches a synonym, this plugin seems to generate a query which always returns all documents. Simply adding e.g. bf=last_modified to any of the unit tests will reveal the bug. The constructed parsed-query looks something like this (assuming query restskatt being expanded with synonym baksmell, and bf=SearchRank^10:

(
  DisjunctionMaxQuery(((PageName:restskatt)^30.0))^1.0 
  ((+DisjunctionMaxQuery(((PageName:baksmell)^30.0)))/no_coord^1.0)
) 
FunctionQuery(float(SearchRank))^10.0"

Investigate generalization

I'm thinking that perhaps the code can be modularized into an advanced Synonyms API for Lucene? With that I mean that it consists of a lucene-synonyms-core which only depends on Lucene classes such as query parsers etc. It would take some config object and the query (class) as input, and spit out the modified query.

I have not investigated the code to see how difficult it would be, but If this is possible, then the same core could be used as a plugin for plain Lucene applications as well as an ElasticSearch plugin and a Solr Plugin? The plugin layer would only provide the config parsing and glue code. Thus for Solr you'd fetch the synonym Analyzer from Schema's FieldTypes, for ElasticSearch the Analyzer would be fetched from the mappings and for Lucene it would be provided directly by the application.

Synonyms only work on search terms in first

Example using ingredient synonyms:
I have a synonym set up as: Swedish turnips, rutabagas

searching for 'Swedish turnips' will give me results that contain rutabagas
searching for 'healthy Swedish turnips' does not. In this case the synonym is not expanded and used in the query. I have verified this by looking at the debug info.

Is this not supported or does it require some kind of config change?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.