Giter Site home page Giter Site logo

Comments (5)

nolanlawson avatar nolanlawson commented on August 19, 2024

Unfortunately it appears that hyphenated synonyms like "e-commerce" can only be used if the synonym file is explicitly purged of all hyphens (which should be replaced with spaces). Commit 4913b62 demonstrates this. The synonym file contains e commerce,electronic commerce, and queries like e-commerce work as expected.

My impression is that this is a weakness of the default configuration we choose for the synonym analyzer. The combination of KeywordTokenizers at one step and a StandardTokenizer at the other causes hyphenated synonyms to be overlooked.

Unfortunately I can't seem to find a combination that satisfies all the unit tests, so for now I'm just recommending that people manually purge their synonym files of hyphenated synonyms.

from hon-lucene-synonyms.

avlukanin avatar avlukanin commented on August 19, 2024

I use WhitespaceTokenizerFactory, not StandardTokenizerFactory in the configuration, that's why the hyphen in e-commerce is preserved. StandardTokenizerFactory is set further in my schema.xml.

from hon-lucene-synonyms.

janhoy avatar janhoy commented on August 19, 2024

How about inserting a PatternReplaceCharFilterFactory Before the tokenizer to remove hyphens?

Den 17. okt. 2013 kl. 08:08 skrev Artem Lukanin [email protected]:

I use WhitespaceTokenizerFactory, not StandardTokenizerFactory in the configuration, that's why the hyphen in e-commerce is preserved. StandardTokenizerFactory is set further in my schema.xml.


Reply to this email directly or view it on GitHub.

from hon-lucene-synonyms.

avlukanin avatar avlukanin commented on August 19, 2024

As I know you cannot use CharFilters in edismax_synonyms in solrconfig.xml. It would be of course a workaround in some cases, but there are situations, when hyphens are completely good characters and you don't want to lose them. For example, if you work with phone numbers like 234-45-56 and with WhitespaceTokenizer -> WordDelimiterFilter with catenateNumbers="1" to convert it into 2344556, you will not get complete phone numbers, but only their parts.

from hon-lucene-synonyms.

nolanlawson avatar nolanlawson commented on August 19, 2024

While investigating #26 and #9, it occurred to me that all of these issues are related. I also think they're really just configuration issues, related to the fact that, in our examples and unit tests, we configure the synonym analyzer to use the StandardTokenizer, which tokenizes UTF-8 and hyphenated synonyms in an unintuitive way (for most folks):

血と骨 -> 血 と 骨 (3 tokens)
e-commerce -> e commerce (2 tokens)

My fix was just to replace the StandardTokenizer with the WhitespaceTokenizer. All the old unit tests still pass, and as a bonus we fully fix #32 and #9, so people don't have to manually replace hyphens with spaces anymore.

Hopefully the WhitespaceTokenizer will work better for most cases. It messes with what gets considered a "shingle" (e.g. 血と骨 becomes one big unigram) once you get to the ShingleFilterFactory, but it seems to fit better with people's expectations of how the synonym expansion should work.

BTW, I also put all this configuration into a single file, so it's easier to modify. The same file that's used for the unit tests is referenced in the README; we can change that later if it becomes awkward.

from hon-lucene-synonyms.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.