Comments (9)
Hi,
- The match for each element score looks at unique words and finds common words divided by all unique words. So in case of
Match
andMatch Rain
we has 2 words with 1 common, so a result of 0.5. The default threshold to consider document to match is greater than 0.5. So this can be fixed by changing the threshold at document to 0.49 - The weight distribution can be applied at each element of a document. But it does need to be consistent for all the documents. It's easier to generate Document object in a separate utility, so that it's consistent.
Here is an example of your code, refactored to make it easier. I just converted FuzzyTitle
to String[]
you can convert it back
private Document generateDocs(String[] stringDoc){
Document document = new Document.Builder(stringDoc[0])
.addElement(new Element.Builder().setType(ElementType.TEXT)
.setValue(stringDoc[1])
.setWeight(.70D).createElement())
.setThreshold(0.49)
.createDocument();
return document;
}
String[] docSearch = {"1", "Match", "Max Studio", "Produced"};
String[] docAvailable1 = {"2", "Match", "Dream Work", "Produced"};
String[] docAvailable2 = {"3", "Match Rain", "Dream Work", "Released"};
Document searchDocument = generateDocs(docSearch);
Document searchDocAvailable1 = generateDocs(docAvailable1);
Document searchDocAvailable2 = generateDocs(docAvailable2);
List documentList = new ArrayList<>();
documentList.add(searchDocAvailable1);
documentList.add(searchDocAvailable2);
MatchService matchService = new MatchService();
Map<Document, List<Match>> map;
map = matchService.applyMatch(searchDocument, documentList);
System.out.println("Match : " + map);
- Now that you have a separate utility that is applied to all document, you can add weights to each element and have it consistent across all document
Here is an example
private Document generateDocs(String[] stringDoc){
Document document = new Document.Builder(stringDoc[0])
.addElement(new Element.Builder().setType(ElementType.TEXT)
.setValue(stringDoc[1])
.setWeight(.70D).createElement())
.addElement(new Element.Builder().setType(ElementType.NAME)
.setValue(stringDoc[2])
.setWeight(.20D).createElement())
.addElement(new Element.Builder().setType(ElementType.TEXT)
.setValue(stringDoc[3])
.setWeight(.10D).createElement())
.setThreshold(0.49)
.createDocument();
return document;
}
But now that multiple elements with different weights are being used, the score and match will be different. But you can change the threshold and see what works best
Hope this helps
from fuzzy-matcher.
Dear Manish,
Thank you for your kind response.
Excellent !! Correctly mentioned, I was missing out the threshold setting. Must say, a great tool for calculating the score while comparing.
Few things I observed when I tested with different data and cases. I would be thankful if you can help me with your expertise.
-
For comparing 'Stop' with 'Stop(2002)' or 'Stops' should return some values. I tested using setType NAME and TEXT. In both the cases its returning 0 score. What changes I should make to get the score?
-
For comparing Numbers, I tested with '1234' & '12345'. It is giving zero matching. I tried with setType NUMBER. How can I improve the result in this case.
-
In case of multiple parameters with weighted value. I used the suggested code.
generateFuzzyDocs(FuzzyTitle TitleDoc){
Document document = new Document.Builder(TitleDoc.getCounter())
.addElement(new Element.Builder().setType(ElementType.NAME)
.setValue(TitleDoc.getTitle().toString())
.setWeight(.70D).createElement())
.addElement(new Element.Builder().setType(ElementType.TEXT)
.setValue(TitleDoc.getDistributor())
.setWeight(.20D).createElement())
.addElement(new Element.Builder().setType(ElementType.TEXT)
.setValue(TitleDoc.getContent())
.setWeight(.10D).createElement())
.setThreshold(0.00)
.createDocument();
return document;
}
weights have been given as .70D, .20D, .10D. And in case of below comparison first 2 parameters are exactly matching and 3rd parameter is partially matching. Yet score is giving as 47% only. Ideally it should give more than 90% score. In the second case the more weighted parameter is partially matching yet its giving 65% which is more than first case. Am I missing out something here ?
Match : {{[{'Spider'}, {'Columbia Pictures'}, {'Adam'}]}=[Match{data={[{'Spider'}, {'Columbia Pictures'}, {'Adam'}]}, matchedWith={[{'Spider'}, {'Columbia Pictures'}, {'AdamJones'}]}, score=0.47368421052631576}]}
Match : {{[{'Inventor'}, {'Columbia Pictures'}, {'Adam'}]}=[Match{data={[{'Inventor'}, {'Columbia Pictures'}, {'Adam'}]}, matchedWith={[{'The Inventor'}, {'Columbia Pictures'}, {'Adam'}]}, score=0.65}]}
-
I am yet to test setType DATE. What process I should follow while comparing given dates to get proper score values.
'22/05/2002', '22/06/2002', '22/05/2020', '08/04/2014' -
For alpha numeric value comparison what settype is preffered ? NAME, ADDRESS or TEXT.
Thank you much in Advance.
from fuzzy-matcher.
-
The default mechanism to tokenize elements of type
TEXT
andNAME
look at whole words, that would require them to be separated by spaces. For elements you described that do not have a well defined separators but sequential characters N-Gram tokenizer is more appropriate. You can override the default in aTEXT
element with a predefined trigram tokenizer.setTokenizerFunction(TokenizerFunction.triGramTokenizer()))
-
Matching numbers '1234' & '12345' using
NUMBER
type expects numerical values that are close to each other , so you will find1240
to be a better match to1234
than12345
. It seems like you want the numerical character to match instead of the actual value. In that caseTEXT
type will be a better fit here, and you can override it with N-Gram tokenizer like above, since these are not words but sequence of characters. -
In my earlier post I apologise for not calling this out. But when working with weights , the default is considered as
1.0
for all elements. To give a higher weight to a few elements, we need to provide a number greater than 1. So leave the least significant element to1.0
and increase the weights of other elements . Try4.0
on first ,2.0
on second and1.0
on third , that should probably give the result you were looking for. -
Dates work similar to numerical values, it uses nearest neighbor match where dates close each other match better than ones further apart. You can play with
.setNeighborhoodRange
parameter while creating an element. This takes values 0 - 1.0 where a higher value will match dates closer to each other. -
For alpha
TEXT
is again preferred, the big difference is on choice of tokenizer. If you do not want each word to be compared independently then override theTokinizerFunction
from fuzzy-matcher.
Dear Manish,
I applied triGramTokenizer() and weightage concept as you suggested and the results are coming as expected :).
For TEXT and NAME type comparison the process is absolutely clear. Thanks a lot for your awesome guidance.
I tried to apply .setNeighborhoodRange with number values as suggested. But the result is coming as either 1 or 0. It's not coming as matching % as TEXT and NAME. I am using the below code.
Document document = new Document.Builder(StringDoc[0])
.addElement(new Element.Builder().setType(ElementType.NUMBER)
.setValue(Integer.valueOf(StringDoc[1]))
.setMatchType(MatchType.NEAREST_NEIGHBORS)
.setNeighborhoodRange(.99D)
.setThreshold(0.00)
.createElement())
.setThreshold(0.00)
.createDocument();
Also I tried with below set up but no luck.
Document document = new Document.Builder(StringDoc[0])
.addElement(new Element.Builder().setType(ElementType.NUMBER)
.setValue(Integer.valueOf(StringDoc[1]))
.setMatchType(MatchType.EQUALITY)
.setThreshold(0.00)
.createElement())
.setThreshold(0.00)
.createDocument();
I have not tried with Date field yet. Hopefully, once I am clear with NUMBER type DATE type should be similar.
Thank you much in Advance.
from fuzzy-matcher.
Dear Manish,
In addition to my previous comment, I also tried with adding valueTokenization as below but did not succeed. Its same giving 1 or 0 but not any in between percentage matching.
Document document = new Document.Builder(StringDoc[0])
.addElement(new Element.Builder().setType(ElementType.NUMBER)
.setValue(Integer.valueOf(StringDoc[1]))
.setTokenizerFunction(TokenizerFunction.valueTokenizer())
.setMatchType(MatchType.NEAREST_NEIGHBORS)
.setNeighborhoodRange(.99D)
.setThreshold(0.00)
.createElement())
.setThreshold(0.00)
.createDocument();
return document;
Also I tried to work with Date field. I am trying to set value by below code but it is raising exception at run time.
com.intuit.fuzzymatcher.exception.MatchException: Unsupported data type.
Date d = new Date();
try{
d = new SimpleDateFormat("dd/MM/yyyy").parse(StringDoc[1]);
}catch(Exception ex){
}
Document document = new Document.Builder(StringDoc[0])
.addElement(new Element.Builder().setType(ElementType.DATE)
.setValue(d)
.setMatchType(MatchType.NEAREST_NEIGHBORS)
.setTokenizerFunction(TokenizerFunction.valueTokenizer())
.setNeighborhoodRange(.99D)
.setThreshold(0.00)
.createElement())
.setThreshold(0.00)
.createDocument();
I request you for your help with Number and Date Type comparison.
Thank you much.
from fuzzy-matcher.
The score for NUMBER
and DATE
is expected to be either 1 or 0. The way scoring works in other elements depends on tokenization. Where each element is broken down into smaller tokens and a percentage of matching tokens is calculated.
Since there is no way to break down these numerical and date values, you will always get a score of 1 if they are within the neighbourhood range, and 0 otherwise
The Unsupported data type exception occurs with the value you are passing is not valid date. I am not able to reproduce it with the code you have. My only suspect is that the Date String is not a valid Date
type, and since we are silencing the exception by catching it, that failure is going un-noticed.
from fuzzy-matcher.
Dear Manish,
Got it. Thank you for your help and guidance. I implemented the concepts and results are mostly as expected :).
I observed one case as below.
Suppose I have name, address and phone number to match and I prepare 2 documents with weightage to each sections. But here let's assume in 1st document name is 'Rohan' and in 2nd document in the address it is 'Sarabai Rohan Street'. So here 'Rohan' is common for which it is setting more matching score. Ideally name should be compared with name and address with address.
Any suggestion in this scenario ?
Thank you.
from fuzzy-matcher.
from fuzzy-matcher.
closing this issue for now. Feel free to open it, if you feel it is still unresolved
from fuzzy-matcher.
Related Issues (20)
- Matching two strings HOT 4
- comparing two string with different dimension HOT 2
- Language Supported HOT 1
- Upgrade to Java 11 HOT 5
- Combine Tokenizers for better results HOT 2
- Phone number assumed to be a US number HOT 3
- Help HOT 1
- Kotlin not support HOT 2
- Name List matcher HOT 2
- Is there any way to create my own matchers? HOT 1
- SLF4J Failed to load HOT 3
- New Element Type for product names HOT 2
- upgrade commons-text to a non-vulnerable version HOT 2
- Information on Library usage HOT 5
- Though there is matching result but matcher is not returning. HOT 3
- How to use getScore in Element class? what is the matchingCount? HOT 2
- Questions HOT 1
- Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches HOT 3
- Why Does Matching Fail in These Scenarios? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fuzzy-matcher.