Giter Site home page Giter Site logo

Comments (9)

manishobhatia avatar manishobhatia commented on August 16, 2024

Hi,

  1. The match for each element score looks at unique words and finds common words divided by all unique words. So in case of Match and Match Rain we has 2 words with 1 common, so a result of 0.5. The default threshold to consider document to match is greater than 0.5. So this can be fixed by changing the threshold at document to 0.49
  2. The weight distribution can be applied at each element of a document. But it does need to be consistent for all the documents. It's easier to generate Document object in a separate utility, so that it's consistent.

Here is an example of your code, refactored to make it easier. I just converted FuzzyTitle to String[] you can convert it back

private Document generateDocs(String[] stringDoc){
        Document document = new Document.Builder(stringDoc[0])
                .addElement(new Element.Builder().setType(ElementType.TEXT)
                        .setValue(stringDoc[1])
                        .setWeight(.70D).createElement())
                .setThreshold(0.49)
                .createDocument();
        return document;
    }

String[] docSearch = {"1", "Match", "Max Studio", "Produced"};
String[] docAvailable1 = {"2", "Match", "Dream Work", "Produced"};
String[] docAvailable2 = {"3", "Match Rain", "Dream Work", "Released"};

Document searchDocument = generateDocs(docSearch);

Document searchDocAvailable1 = generateDocs(docAvailable1);

Document searchDocAvailable2 = generateDocs(docAvailable2);

List documentList = new ArrayList<>();
documentList.add(searchDocAvailable1);
documentList.add(searchDocAvailable2);

MatchService matchService = new MatchService();
Map<Document, List<Match>> map;
map = matchService.applyMatch(searchDocument, documentList);
System.out.println("Match : " + map);
  1. Now that you have a separate utility that is applied to all document, you can add weights to each element and have it consistent across all document

Here is an example

private Document generateDocs(String[] stringDoc){
        Document document = new Document.Builder(stringDoc[0])
                .addElement(new Element.Builder().setType(ElementType.TEXT)
                        .setValue(stringDoc[1])
                        .setWeight(.70D).createElement())
                .addElement(new Element.Builder().setType(ElementType.NAME)
                        .setValue(stringDoc[2])
                        .setWeight(.20D).createElement())
                .addElement(new Element.Builder().setType(ElementType.TEXT)
                        .setValue(stringDoc[3])
                        .setWeight(.10D).createElement())
                .setThreshold(0.49)
                .createDocument();
        return document;
    }

But now that multiple elements with different weights are being used, the score and match will be different. But you can change the threshold and see what works best

Hope this helps

from fuzzy-matcher.

deepak-Jenkins avatar deepak-Jenkins commented on August 16, 2024

Dear Manish,

Thank you for your kind response.

Excellent !! Correctly mentioned, I was missing out the threshold setting. Must say, a great tool for calculating the score while comparing.

Few things I observed when I tested with different data and cases. I would be thankful if you can help me with your expertise.

  1. For comparing 'Stop' with 'Stop(2002)' or 'Stops' should return some values. I tested using setType NAME and TEXT. In both the cases its returning 0 score. What changes I should make to get the score?

  2. For comparing Numbers, I tested with '1234' & '12345'. It is giving zero matching. I tried with setType NUMBER. How can I improve the result in this case.

  3. In case of multiple parameters with weighted value. I used the suggested code.

    generateFuzzyDocs(FuzzyTitle TitleDoc){
    Document document = new Document.Builder(TitleDoc.getCounter())
    .addElement(new Element.Builder().setType(ElementType.NAME)
    .setValue(TitleDoc.getTitle().toString())
    .setWeight(.70D).createElement())
    .addElement(new Element.Builder().setType(ElementType.TEXT)
    .setValue(TitleDoc.getDistributor())
    .setWeight(.20D).createElement())
    .addElement(new Element.Builder().setType(ElementType.TEXT)
    .setValue(TitleDoc.getContent())
    .setWeight(.10D).createElement())
    .setThreshold(0.00)
    .createDocument();
    return document;
    }

weights have been given as .70D, .20D, .10D. And in case of below comparison first 2 parameters are exactly matching and 3rd parameter is partially matching. Yet score is giving as 47% only. Ideally it should give more than 90% score. In the second case the more weighted parameter is partially matching yet its giving 65% which is more than first case. Am I missing out something here ?

Match : {{[{'Spider'}, {'Columbia Pictures'}, {'Adam'}]}=[Match{data={[{'Spider'}, {'Columbia Pictures'}, {'Adam'}]}, matchedWith={[{'Spider'}, {'Columbia Pictures'}, {'AdamJones'}]}, score=0.47368421052631576}]}

Match : {{[{'Inventor'}, {'Columbia Pictures'}, {'Adam'}]}=[Match{data={[{'Inventor'}, {'Columbia Pictures'}, {'Adam'}]}, matchedWith={[{'The Inventor'}, {'Columbia Pictures'}, {'Adam'}]}, score=0.65}]}

  1. I am yet to test setType DATE. What process I should follow while comparing given dates to get proper score values.
    '22/05/2002', '22/06/2002', '22/05/2020', '08/04/2014'

  2. For alpha numeric value comparison what settype is preffered ? NAME, ADDRESS or TEXT.

Thank you much in Advance.

from fuzzy-matcher.

manishobhatia avatar manishobhatia commented on August 16, 2024
  1. The default mechanism to tokenize elements of type TEXT and NAME look at whole words, that would require them to be separated by spaces. For elements you described that do not have a well defined separators but sequential characters N-Gram tokenizer is more appropriate. You can override the default in a TEXT element with a predefined trigram tokenizer .setTokenizerFunction(TokenizerFunction.triGramTokenizer()))

  2. Matching numbers '1234' & '12345' using NUMBER type expects numerical values that are close to each other , so you will find 1240 to be a better match to 1234 than 12345. It seems like you want the numerical character to match instead of the actual value. In that case TEXT type will be a better fit here, and you can override it with N-Gram tokenizer like above, since these are not words but sequence of characters.

  3. In my earlier post I apologise for not calling this out. But when working with weights , the default is considered as 1.0 for all elements. To give a higher weight to a few elements, we need to provide a number greater than 1. So leave the least significant element to 1.0 and increase the weights of other elements . Try 4.0 on first , 2.0 on second and 1.0 on third , that should probably give the result you were looking for.

  4. Dates work similar to numerical values, it uses nearest neighbor match where dates close each other match better than ones further apart. You can play with .setNeighborhoodRange parameter while creating an element. This takes values 0 - 1.0 where a higher value will match dates closer to each other.

  5. For alpha TEXT is again preferred, the big difference is on choice of tokenizer. If you do not want each word to be compared independently then override the TokinizerFunction

from fuzzy-matcher.

deepak-Jenkins avatar deepak-Jenkins commented on August 16, 2024

Dear Manish,

I applied triGramTokenizer() and weightage concept as you suggested and the results are coming as expected :).
For TEXT and NAME type comparison the process is absolutely clear. Thanks a lot for your awesome guidance.

I tried to apply .setNeighborhoodRange with number values as suggested. But the result is coming as either 1 or 0. It's not coming as matching % as TEXT and NAME. I am using the below code.
Document document = new Document.Builder(StringDoc[0])
.addElement(new Element.Builder().setType(ElementType.NUMBER)
.setValue(Integer.valueOf(StringDoc[1]))
.setMatchType(MatchType.NEAREST_NEIGHBORS)
.setNeighborhoodRange(.99D)
.setThreshold(0.00)
.createElement())
.setThreshold(0.00)
.createDocument();

Also I tried with below set up but no luck.
Document document = new Document.Builder(StringDoc[0])
.addElement(new Element.Builder().setType(ElementType.NUMBER)
.setValue(Integer.valueOf(StringDoc[1]))
.setMatchType(MatchType.EQUALITY)
.setThreshold(0.00)
.createElement())
.setThreshold(0.00)
.createDocument();

I have not tried with Date field yet. Hopefully, once I am clear with NUMBER type DATE type should be similar.

Thank you much in Advance.

from fuzzy-matcher.

deepak-Jenkins avatar deepak-Jenkins commented on August 16, 2024

Dear Manish,

In addition to my previous comment, I also tried with adding valueTokenization as below but did not succeed. Its same giving 1 or 0 but not any in between percentage matching.

Document document = new Document.Builder(StringDoc[0])
.addElement(new Element.Builder().setType(ElementType.NUMBER)
.setValue(Integer.valueOf(StringDoc[1]))
.setTokenizerFunction(TokenizerFunction.valueTokenizer())
.setMatchType(MatchType.NEAREST_NEIGHBORS)
.setNeighborhoodRange(.99D)
.setThreshold(0.00)
.createElement())
.setThreshold(0.00)
.createDocument();
return document;

Also I tried to work with Date field. I am trying to set value by below code but it is raising exception at run time.
com.intuit.fuzzymatcher.exception.MatchException: Unsupported data type.

            Date d = new Date();
	try{
	    d = new SimpleDateFormat("dd/MM/yyyy").parse(StringDoc[1]);
	}catch(Exception ex){			
	}
    Document document = new Document.Builder(StringDoc[0])
            .addElement(new Element.Builder().setType(ElementType.DATE)
                    .setValue(d)
                    .setMatchType(MatchType.NEAREST_NEIGHBORS)
                    .setTokenizerFunction(TokenizerFunction.valueTokenizer())
                    .setNeighborhoodRange(.99D)
                    .setThreshold(0.00)
                    .createElement())
            .setThreshold(0.00)        
            .createDocument();

I request you for your help with Number and Date Type comparison.

Thank you much.

from fuzzy-matcher.

manishobhatia avatar manishobhatia commented on August 16, 2024

The score for NUMBER and DATE is expected to be either 1 or 0. The way scoring works in other elements depends on tokenization. Where each element is broken down into smaller tokens and a percentage of matching tokens is calculated.

Since there is no way to break down these numerical and date values, you will always get a score of 1 if they are within the neighbourhood range, and 0 otherwise

The Unsupported data type exception occurs with the value you are passing is not valid date. I am not able to reproduce it with the code you have. My only suspect is that the Date String is not a valid Date type, and since we are silencing the exception by catching it, that failure is going un-noticed.

from fuzzy-matcher.

deepak-Jenkins avatar deepak-Jenkins commented on August 16, 2024

Dear Manish,

Got it. Thank you for your help and guidance. I implemented the concepts and results are mostly as expected :).

I observed one case as below.

Suppose I have name, address and phone number to match and I prepare 2 documents with weightage to each sections. But here let's assume in 1st document name is 'Rohan' and in 2nd document in the address it is 'Sarabai Rohan Street'. So here 'Rohan' is common for which it is setting more matching score. Ideally name should be compared with name and address with address.

Any suggestion in this scenario ?

Thank you.

from fuzzy-matcher.

manishobhatia avatar manishobhatia commented on August 16, 2024

from fuzzy-matcher.

manishobhatia avatar manishobhatia commented on August 16, 2024

closing this issue for now. Feel free to open it, if you feel it is still unresolved

from fuzzy-matcher.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.