intuit / fuzzy-matcher Goto Github PK

A Java library to determine probability of objects being similar.

License: Apache License 2.0

Java 100.00%

fuzzy-matcher's Issues

Full name matching

Quite new here..

I'm looking into using fuzzy-matcher in a sanction list monitoring service (will share on GitHub once ready) but I'm facing an issue with figuring out the best way to address full names

Example:
Sanction lists provides a name such as "first last" but matching against "first middle1 middle2 last".

What is the best approach to match the latter to the first? In other words, ignore "middle1 middle2" from the query? I can programatically tokenize the search query and run it for various combinations. Furthermore, in case of a middle initial (I saw an issue earlier that was not resolved), can a weight be given to the initial?

For example "First M Last" should get a higher score compared to "First Last".

Any feedback would be greatly appreciated.

Phone number assumed to be a US number

Hi @manishobhatia,
In the pre-processing steps for a phone number country code '1' gets added to the number making it specific to US.
I would like to work on the fix for this. Here's what I have in mind. As part of normalization, will not add country code but rather remove if present and so that phone number match happens without the country code.

Threshold matching not changing for ElementType: DATE

I am trying to understand how to change the threshold matching for a set of 3 dates. "07/15/2019", "01/01/2020", and "01/02/2020". The matching score always comes back from 1.0 from for all three dates. I have tried changing the threshold on the Element level and the Document level and it doesn't make a difference. How can I change the matching so that a date 07/15/2019 does not match 01/01/2020 with the same score that 01/01/2020 matches 01/02/2020?
Here is my sample code: Similar to the junit test I found in your project, but modified for a springboot application.

@Component
public class FuzzylogicRunner implements CommandLineRunner {


    @Override
    public void run(String... args) {
        MatchService matchService = new MatchService();

        List<Object> dates = Arrays.asList(getDate("07/15/2019"), getDate("01/01/2020"), getDate("01/02/2020"));
        List<Document> documentList1 = getTestDocuments(dates, DATE, null);
        Map<Document, List<Match<Document>>> result1 = matchService.applyMatch(documentList1);

        result1.forEach((key, value) -> value.forEach(match -> {
            System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
        }));
    }

    private List<Document> getTestDocuments(List<Object> values, ElementType elementType, Double neighborhoodRange) {
        AtomicInteger ai = new AtomicInteger(0);
        return values.stream().map(num -> {
            Element.Builder elementBuilder = new Element.Builder().setType(elementType).setValue(num).setThreshold(0.1);
            if (neighborhoodRange != null) {
                elementBuilder.setNeighborhoodRange(neighborhoodRange);
            }
            return new Document.Builder(Integer.toString(ai.incrementAndGet()))
                    .addElement(elementBuilder.createElement()).setThreshold(0.1)
                    .createDocument();
        }).collect(Collectors.toList());
    }

    private Date getDate(String val) {
        DateFormat df = new SimpleDateFormat("MM/dd/yyyy");
        try {
            return df.parse(val);
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return null;
    }
}

The output is always (even if I modify the threshold from 0.1 to 0.9:
Data: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Matched With: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Score: 1.0
Data: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Matched With: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Score: 1.0
Data: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Matched With: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Score: 1.0
Data: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Matched With: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Score: 1.0
Data: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Matched With: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Score: 1.0
Data: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Matched With: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Score: 1.0

Date Match with NeighborhoodRange greater than 0.91 fails to give valid results

DATE element type allows user to override the NeighborhoodRange , but a value of greater than 0.91 causes poor matches to show up.

This is a test case to trigger it in MatchServiceTest.java that can show the failure

@Test
    public void itShouldApplyMatchWithDate() {
        List<Object> dates = Arrays.asList(getDate("01/01/2020"), getDate("01/02/2020"), getDate("07/15/2019"));
        List<Document> documentList = getTestDocuments(dates, DATE, 0.91);
        Map<Document, List<Match<Document>>> result = matchService.applyMatch(documentList);
        result.entrySet().forEach(entry -> {
            entry.getValue().forEach(match -> {
                System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
            });
        });

        Assert.assertEquals(2, result.size());
    }

As we increate the value greater than 0.91, dates that are not in the neighborhood shows up in results.

The issue is primarily in the incorrect usage of this

fuzzy-matcher/src/main/java/com/intuit/fuzzymatcher/component/TokenRepo.java

Line 89 in d2ce6f6

private static final double DATE_SCALE_FACTOR = 1.1;

It is increasing the TokenRanges lower and higher bounds to broader values, causing incorrect matches to show up.

Find the most similar array

Hi,

Say I have a list of sentences Collection<String> sentences and I want to simply find the most similar among these to another sentence. Is this possible ?

I read the first README file and example there only ingests one input (documents)

Thanks !

Why Does Matching Fail in These Scenarios?

Hello,

I tried to modify some names for matching [Ahmed, Mohamed, Jouliana] by making slight alterations and then attempted to find matches. I anticipated matches like (1,2), (3,4), (3,5), (4,5), and (6,7).

Here's the code snippet I used:

String[][] input = {
    {"1", "Ahmed"},
    {"2", "Ahme"},
    {"3", "Mohamed"},
    {"4", "Mohame"},
    {"5", "Mahamad"},
    {"6", "Jouliana"},
    {"7", "Jouli"}
};

List<Document> documentList = Arrays.asList(input).stream().map(contact -> {
    return new Document.Builder(contact[0])
            .addElement(new Element.Builder<String>().setValue(contact[1]).setType(NAME).createElement())
            // .addElement(new Element.Builder<String>().setValue(contact[1]).setType(TEXT).createElement())
            .createDocument();
}).collect(Collectors.toList());

However, I only received the following result when the type is set to NAME:

Data: {[{'Mohamed'}]} Matched With: {[{'Mahamad'}]} Score: 1.0
Data: {[{'Mahamad'}]} Matched With: {[{'Mohamed'}]} Score: 1.0

And no results were obtained when the type is set to TEXT.

Some issues with the matching

Hi Manish,

This is an extremely neat tool you have developed, kudos!! I have been playing around with it and have run into a couple of issues. I was hoping you would help me resolve them.

I have been trying to match a single records against a database of records, and have been going up from 10000 to 500000. This is how I have been configuring the database:

new Document.Builder(csv[0])
.addElement(new Element.Builder().setType(NAME).setValue(getName(csv)).createElement())
.addElement(new Element.Builder().setType(ADDRESS).setValue(getAddress(csv)).createElement())
.addElement(new Element.Builder().setType(PHONE).setValue(csv[8]).setWeight(2)
.setThreshold(0.5).createElement())
.addElement(new Element.Builder().setType(EMAIL).setValue(csv[10]).createElement())
.setThreshold(0.5)
.createDocument();

But I am seeing some anomalies.

I am always seeing one record being returned, whereas I am expecting all records with a threshold greater than 0.5 . And I know that in each case, there are multiple records that should pass the threshold. This is how I am printing the records:
result.entrySet().forEach(entry -> {
entry.getValue().forEach(match -> {
System.out.println("Person searched: " + match.getData() + "\nMatched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
});
});

I don't have a unique identifier for each record. this is how my CSV looks like:
"first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"
Do you think that might be the issue?

I am seeing something like this:
Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
Matched With: {[{'Susanna Desiga'}, {'4 W Broad St San Juan Capistrano Orange CA 92675'}, {'[email protected]'}, {'949-622-6261'}]} Score: 0.8668599263800767

Given the only thing remotely in common is the first name, I am wondering why there is such a high matching score. Whereas something like:
Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'[email protected]'}, {'732-478-7394'}]} Score: 0.7142857142857143
which is actually a better match is only getting a score of 0.7142.
Other "bad" match, but good score examples:
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
Matched With: {[{'Susanna Fedak'}, {'4983 Mcallister St Cambridge Middlesex MA 02138'}, {'[email protected]'}, {'617-357-4376'}]} Score: 0.7142857142857143
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
Matched With: {[{'Susanna Molavi'}, {'17389 Market St #8 Pearl City Honolulu HI 96782'}, {'[email protected]'}, {'808-723-3110'}]} Score: 0.7142857142857143

And only these "bad" matches were returned despite the "good" match being present in the database:
{[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'[email protected]'}, {'732-478-7394'}]} Score: 0.7142857142857143

Any suggestions on how to tune the match database to get better results than this? Would be greatly appreciated!

FYI, the data I am using here is all fake.

Thanks!
Vishal.

Element constructor does not initialize the scoringFunction with the value passed to the constructor signature

Line 63 of Element class is assigning the this.scoringFunction to itself when the scoringFunction passed through the constructor is different of null.

fuzzy-matcher/src/main/java/com/intuit/fuzzymatcher/domain/Element.java

Line 63 in 81d5113

    
           this.scoringFunction = scoringFunction != null ? this.scoringFunction : DEFAULT_ELEMENT_SCORING;

Trying to match names with initials

Hi,
I am trying to match names such a A Mathur with ABhishek Mathur or Donald Trump with D Trump. Is there some simple parameter i can adjust to allow that ?

Thanks
Abhi

Help

Sry if i ask, i want to learn how to do libraries like this and i dont understand how to test your project, really sorry to bother u

Matching two strings

Hi,

I came across this library and tried some very elementary test, matching two strings to find the score.
After going through readme file, several null pointer exceptions and issues already raised, I finally able to play something around that by doing this:
`List input = Arrays.asList(firstStr, secondStr);

	Document document = null;
	List<Document> documentList = new ArrayList<>();
	for (String str : input) {
		document = new Document.Builder(str)
				.addElement(new Element.Builder<String>().setValue(str).setType(ElementType.TEXT).createElement())
				.createDocument();
		documentList.add(document);
	}
	Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);

	result.entrySet().forEach(entry -> {
		entry.getValue().forEach(match -> {
			System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: "
					+ match.getScore().getResult());
		});
	});`

However it is found that:

The strings match each other twice, i.e firstStr matches to secondStr and then vice-versa while only a single match was required.
I tried string with numbers like "Nayan J Bayan 123" with elementType TEXT, the result came out as a blank map. What should be the element type if I want to check strings containing numbers and special characters? I also tried with ADDRESS type and same blank map was returned.
I don't think at all my implementation is correct. What should have been the correct implementation?

Thanks

Support for configuring abbreviated names

We have a situation where regular abbreviations of names is not matching the proper name and not sure if that is supported or not.

For example:

'Barney' - an abbreviation of 'Barnaby' will not match with 'Barnaby'.

If not currently supported I was wondering if it's worth considering supporting it by allowing an app to supply a map of abbreviations to known names so that if, say, all other matching fails then if any 'NAME' type components of the match document contain abbreviations in the map then a substitution could be made with the proper name and a match reattempted.

Domains: how to implement a new/different domain

Hi,
excellent library and I'd love to apply the functionality to other domains. The current code uses address details. As far as I can tell the Element classes would have to be rewritten to accommodate another domain.
Am I missing something?
cheers,
-gabe

Cross-Language Fuzzy Matching: Arabic Document Matching returns 0 matches

Hello,

Does the algorithm support fuzzy matching between two non-English strings? I'm trying to match two Arabic records but returns 0 matches even with identical documents. Is this feasible, or are there any workarounds available?

Thank you!

Is there any way to create my own matchers?

Hi @manishobhatia,

I couldn't find a way to create my own matcher trying to solve the following problem:

I have document numbers and I would like to compare them with others by the digit they share. For instance, document A = 12305211C LPZ, document B = 12105321C CBA.
The matching score is 1.0

Explanation:

Input A:
After preprocessing: 12305211, (deleting non-digit chars)
After tokenization: int[] occurrencesA, for instance occurrencesA[1] is 3

Input B:
After preprocessing: 12105321C, (deleting non-digit chars)
After tokenization: int[] occurrencesB, for instance occurrencesB[1] is 3

Matching Function A with B: (number of occurrences) / (length of A).

I would appreciate it if you have any workaround or approach to solve the problem with the library.

Thank you very much,
I wish you the best

Though there is matching result but matcher is not returning.

Hi Manish,
I am trying to make use of fuzzy-matcher for which i have written one sample test class.( the code is appended below).
I am surprised to know the system is not returning match record. I am not able to figure out what wrong has been done.

could you please help?
Amarjeet

// Java code start here
package test;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

import com.intuit.fuzzymatcher.component.MatchService;
import com.intuit.fuzzymatcher.domain.Document;
import com.intuit.fuzzymatcher.domain.Element;
import com.intuit.fuzzymatcher.domain.ElementType;
import com.intuit.fuzzymatcher.domain.Match;

public class SampleTest {

public static void main(String[] args) {
	MatchService matchService = new MatchService();
	SampleTest test = new SampleTest();
	Map<Document, List<Match<Document>>> map;
    
	
	List<Document> matchDocument = test.loadList();
	
	Customer searchCust = new Customer();
	searchCust.setId(1);
	searchCust.setfName("Amarjeet");
	searchCust.setmName("Rampher");
	searchCust.setlName("Tiwari");
	searchCust.setAadhar("123456789012");
	//searchCust.setAddress("25 avenue");
	//searchCust.setPan("asdf345ghb");
	Document searchDocument = new Document.Builder(""+searchCust.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getAadhar()).createElement())
            //.addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getPan()).createElement())
            .createDocument();
	
	List<Document> documentList = new ArrayList<>();
    documentList.add(searchDocument);
    
    map = matchService.applyMatch(matchDocument, documentList);
    System.out.println("Match : " + map);
	Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);	 
	System.out.println("Match : " + result);
	
	
}


public class MatchServiceFactory {
	
    private MatchService matchService = new MatchService();

    public MatchService getMatchService() {
        return this.matchService;
   }
}

//Load the list 
public List<Document> loadList() {
	Customer cust = new Customer();
	Customer cust1 = new Customer();
	Customer cust2 = new Customer();
	Customer cust3 = new Customer();
	Customer cust4 = new Customer();
	Customer cust5 = new Customer();
	Customer cust6 = new Customer();
	Customer cust7 = new Customer();
	Customer cust8 = new Customer();
	Customer cust9 = new Customer();
	Customer cust10 = new Customer();
	cust.setId(1);
	cust.setfName("Amarjeet");
	cust.setmName("Rampher");
	cust.setlName("Tiwari");
	cust.setAadhar("123456789012");
	cust.setAddress("137 hopkins ave jersey city");
	cust.setPan("adopt8451z");
	
	cust1.setId(2);
	cust1.setfName("fname2");
	cust1.setmName("mname2");
	cust1.setlName("lname3");
	cust1.setAadhar("12233434");
	cust1.setAddress("25 avenue2");
	cust1.setPan("asdf345ghb2");

	cust2.setId(3);
	cust2.setfName("fname3");
	cust2.setmName("mname3");
	cust2.setlName("lname3");
	cust2.setAadhar("122334343");
	cust2.setAddress("25 avenue3");
	cust2.setPan("asdf345ghb3");	
	
	Document searchDocument = new Document.Builder(""+cust.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getAadhar()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getPan()).createElement())
            .createDocument();

	Document searchDocument1 = new Document.Builder(""+cust1.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getAadhar()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getPan()).createElement())

            .createDocument();

	Document searchDocument2 = new Document.Builder(""+cust2.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getAadhar()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getPan()).createElement())
            .createDocument();
	    
	List<Document> documentList = new ArrayList<>();
    documentList.add(searchDocument);
    documentList.add(searchDocument1);
    documentList.add(searchDocument2);
    return documentList;
}

}

class Customer{
	int id;
	String fName;
	String mName;
	String lName;
	String address;
	String pan;
	String aadhar;
	
	public int getId() {
		return id;
	}
	public void setId(int id) {
		this.id = id;
	}
	public String getfName() {
		return fName;
	}
	public void setfName(String fName) {
		this.fName = fName;
	}
	public String getmName() {
		return mName;
	}
	public void setmName(String mName) {
		this.mName = mName;
	}
	public String getlName() {
		return lName;
	}
	public void setlName(String lName) {
		this.lName = lName;
	}
	public String getAddress() {
		return address;
	}
	public void setAddress(String address) {
		this.address = address;
	}
	public String getPan() {
		return pan;
	}
	public void setPan(String pan) {
		this.pan = pan;
	}
	public String getAadhar() {
		return aadhar;
	}
	public void setAadhar(String aadhar) {
		this.aadhar = aadhar;
	}
}

// Java code ends here

How to use getScore in Element class? what is the matchingCount?

Hello 👋,

I've got a look at the code snippet provided in Element class:

public double getScore(Integer matchingCount, Element other) {
    return (double)matchingCount / (double)this.getChildCount(other);
}

public long getChildCount(Matchable other) {
    if (other instanceof Element) {
        Element<T> o = (Element)other;
        return (long)Math.max(this.getTokens().size(), o.getTokens().size());
    } else {
        return 0L;
    }
}

I'm doing matching between two Documents (Rows) and aiming to obtain the score for each element in the match. I presume each document contains only one document match, which has the highest matching score (row-wise).

I primarily encounter two issues:

What does matchingCount represent? I would appreciate any examples.
How can I map between elements in document 1 with their corresponding elements in document 2? I notice that elements don't have a key attribute or something similar that could help in establishing the mapping.

Thank you.

Error on tests

Hi !

I am getting this error when trying to install

`Tests run: 81, Failures: 0, Errors: 8, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.490 s
[INFO] Finished at: 2019-09-03T17:25:48-03:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project fuzzy-matcher: There are test failures.
[ERROR]
[ERROR] Please refer to c:\Users\Luccas Klotz\Downloads\fuzzy-matcher-master\target\surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException`

If I go to the \fuzzy-matcher-master\target\surefire-reports, errors are in the tests where files such as demo.csv and test-data.csv could not be found, but the files are there.

Any clues?

Thanks !

Aggregate strings based on similarity into groups.

Is there a way how to aggregate Strings based on similarity into groups? I have list of strings List<String> strings and I need to group them.

MatchService provides applyMatchByGroups which is very close, but in the inner Set contains every match for the group I need instead of only unique values which is what I need.

Thanks in advance for the answer.

New Element Type for product names

I've been trying to compare model names of different machines.
I played around with the configuration of my elements but I'm not getting good results.

I would like to get a high match score if one string contains the similar smaller string

STP375S-B60/Wnh_1500V_20V02_1756
STP 375S-B60/Wnh

If there is a configuration that would result in my desired score that would be the best but if not then I suggest having a new Element type "Product Designation" that could compare strings like "M80 PDF C" sensibly.

IllegalStateException: stream has already been operated upon or closed

I got the following exception when attempting to match a Document.

Caused by: java.lang.IllegalStateException: stream has already been operated upon or closed
	at java.util.stream.AbstractPipeline.<init>(AbstractPipeline.java:203) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.<init>(ReferencePipeline.java:94) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$StatelessOp.<init>(ReferencePipeline.java:618) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$2.<init>(ReferencePipeline.java:163) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.filter(ReferencePipeline.java:162) ~[na:1.8.0_181]
	at com.intuit.fuzzymatcher.domain.Token.getSearchGroups(Token.java:55) ~[fuzzy-matcher-0.4.2.jar:na]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.lambda$null$12(MatchOptimizerFunction.java:177) ~[fuzzy-matcher-0.4.2.jar:na]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[na:1.8.0_181]
	at java.util.stream.DistinctOps$1$2.accept(DistinctOps.java:175) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_181]
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[na:1.8.0_181]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.lambda$initializeSearchGroups$13(MatchOptimizerFunction.java:171) ~[fuzzy-matcher-0.4.2.jar:na]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_181]
	at java.util.stream.DistinctOps$1$2.accept(DistinctOps.java:175) ~[na:1.8.0_181]
	at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1625) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[na:1.8.0_181]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.initializeSearchGroups(MatchOptimizerFunction.java:166) ~[fuzzy-matcher-0.4.2.jar:na]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.lambda$searchGroupOptimizer$8(MatchOptimizerFunction.java:148) ~[fuzzy-matcher-0.4.2.jar:na]
	at com.intuit.fuzzymatcher.component.TokenMatch.lambda$matchTokens$1(TokenMatch.java:26) ~[fuzzy-matcher-0.4.2.jar:na]
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) ~[na:1.8.0_181]
	at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1696) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_181]
	at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) ~[na:1.8.0_181]
	at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721) ~[na:1.8.0_181]
	at java.util.stream.AbstractTask.compute(AbstractTask.java:316) ~[na:1.8.0_181]
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinPool$WorkQueue.execLocalTasks(ForkJoinPool.java:1040) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1058) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[na:1.8.0_181]

Information on Library usage

Hi,
Can i use this library for matching payment and invoice records.
let's say on payment side i have 5-6 properties { invoice-number, invoice-amount, customer-name, invoice-date, customer-email}
Note : These properties are extracted for payments through the advice document we receive.
Same properties will be available for invoice as well { invoice-number, invoice-amount, customer-name, invoice-date, customer-email}

Can i use this library to find which invoice is the most suitable match for the payment

For example - let's consider there is a payment and 6invoices

Will i be able to score all the invoices against a payment

payment 1 - invoice 1 - score is 0.9
payment 1 - invoice 2 - score is 0.8
payment 1 - invoice 3 - score is 0.7
payment 1 - invoice 4 - score is 0.6
payment 1 - invoice 5 - score is 0.65
payment 1 - invoice 6 - score is 0.77

Inconsistency in matching results

Hi there!

I am running some matching tests with your library on a large dataset of bank transactions. The documents used for matching have a single TEXT field, containing the description of the transaction. On running the matching, I found that some strings that should be an exact match, come back with a probability as low as 0.5. See the below sample:

{[{'Immediate Payment Fee,'}]}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 1.0}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}

The top line is the document that is matched and the other lines are the matching results. The implementation of my test is as straightforward as can be, I believe:

            // Build document input
            List<Document> documents = new ArrayList<>();
            transactions.stream().forEach(t -> {

                documents.add(new Document.Builder(String.valueOf(index.incrementAndGet()))
                    .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(t.getDescription()).createElement())
                    .createDocument());
            });

            MatchService matchService = new MatchService();
            Map<Document, List<Match<Document>>> matched = matchService.applyMatch(documents);

            for (Iterator<Document> i = matched.keySet().iterator(); i.hasNext(); ) {

                Document d = i.next();
                System.out.println("");
                System.out.println(d.toString());
                matched.get(d).stream().forEach(doc -> {

                    System.out.println("Matching " + doc.toString());
                });
            }

I'd love to use your library for my project and hope to contribute, but I think I need to first understand why not all matches are being returned as 1.0 as all the strings are identical.

Thanks!

Support for distinct groups of similar strings

Here is the scenario for input like String[][] input = {
{"1", "Nike"},
{"2", "Puma"},
{"3", "Niket dhaka"},
{"4","Levi's"},
{"5","Levi's Fashion for Men and Women"},
{"6","Nike"},
{"7","Puma Sports and Fitness"},
{"8","Puma Shoes"},
{"9","Fashion Nova"},
{"10","H&M Fashion"},
{"11","Nike Sports"}};
the groups formed are something like this: [
{
"0": {
"id": 11,
"name": "Nike Sports"
},
"1": {
"id": 1,
"name": "Nike"
},
"2": {
"id": 6,
"name": "Nike"
}
},
{
"0": {
"id": 1,
"name": "Nike"
},
"1": {
"id": 6,
"name": "Nike"
},
"2": {
"id": 11,
"name": "Nike Sports"
}
},
{
"0": {
"id": 2,
"name": "Puma"
},
"1": {
"id": 8,
"name": "Puma Shoes"
}
},
{
"0": {
"id": 6,
"name": "Nike"
},
"1": {
"id": 1,
"name": "Nike" },
"2": {
"id": 11,
"name": "Nike Sports"
}}, {
"0": {
"id": 8,
"name": "Puma Shoes"
},
"1": {
"id": 2,
"name": "Puma"
}},{"0": {
"id": 9,
"name": "Fashion Nova"},
"1": {
"id": 10,
"name": "HM Fashion"}},{
"0": {
"id": 10,
"name": "HM Fashion"},
"1": {
"id": 9,
"name": "Fashion Nova"
}}]

You can see groups with id 11,1,6 and 1,6,11 are formed. is there any way to get only distinct groups?

Any way to add name dictionary for normalization characters

I noticed that we have a name dictionary.txt。 Is any way I can override or add some mapping config in this dictionary

Upgrade to Java 11

Java 8 is nearing EOL, looking to upgrade this project to support Java 11.

Identify dependent libraries support for java 11 and upgrade them if necessary
Unit test should pass and no significant performance impact should be seen

Name List matcher

My use case is to compare individuals from an ancestry application. I need the ability to check on a list of sibling names and children names.

I am currently using the library to check names, gender, birthdates and such. These are all single values.

Now imagine I have a random number of siblings. I would like to call it a match if at least one sibling name matches between individual A and B. I also would like to do the same with a list of child names.

I have not thought of a way to use the existing functionality to do this. Perhaps it can be done. I have only been using the library for a week. If so, please tell me how. If not, I am asking that the library be extended to support a list of names as a type.
I could imagine it being used like this:

.addElement(new Element.Builder<List>()
.setValue(individual.getSiblingNames())
.setVariance("Siblings")
.setType(ElementType.NAME_LIST)
.setWeight(0.3)
.createElement())

Maybe a setting for how many name matches are required for it to fuzzy match?
I would be satisfied with just one, but it could be more generic.

Questions

Thanks for creating this library! I have not looked into the code in detail, but would like to ask a few questions?

it looks like the "soundex" functionality from Apache is heavily biased to the English language only?
Do you have any estimates of the performance/complexity of the algorithms involved? I am considering this library for searching/iterating through in-memory data objects that have no other indexing.

Kotlin not support

Please provide the kotlin language support. In kotlin, always result object is empty.

upgrade commons-text to a non-vulnerable version

A new arbitrary code execution vul has been found in commons-text:

org.apache.commons
commons-text
1.9

https://security.snyk.io/package/maven/org.apache.commons:commons-text/1.9

version 1.10.0 is not vulnerable.

Support for Age Element type

On similar lines to NUMBER element , can we can an AGE element.
The expectation here is users age can change by a small amount if captured at different times.

So we should see a higher score if the age values are closer (around 1 or 2 years difference) and gets progressively lower with 0 score for age difference over 5 years

Combine Tokenizers for better results

Hi, I had problems using a single tokenizer for matching names.
The wordSoundexEncodeTokenizer was matching as equal two different names. I was matching "Caputo" and the MatchService returned "Caputo" and "Chabot" with equal score.
The wordTokenizer was skipping "Nikolau" as the correct match was "Nikolaou".
The triGramTokenizer was skipping "Leao", when there was a direct match with "Rafael Leao".

I found a temporary solution concatenating the Tokenizers with a custom method:

@SafeVarargs
    public static <T> Function<Element<T>, Stream<Token<T>>> concatTokenizers(Function<Element<T>, Stream<Token<T>>>... funct) {
        return element -> Arrays.stream(funct).flatMap(fun -> fun.apply(element));
    }

and using it like

                .setTokenizerFunction(concatTokenizers(
                        TokenizerFunction.wordTokenizer(),
                        TokenizerFunction.wordSoundexEncodeTokenizer(),
                        TokenizerFunction.triGramTokenizer()
                ))

I'm not sure if this is the correct approach, but I hope that the function is helpful to others with the same problem.
If the solution is correct I would like to have it inserted in the library.

The results after the use of the function were as expected and all items matched perfectly, but I would suggest a further development with, if possible, a weight given to the tokenizers or listing the tokenizers in order and when one gives no results the next one is used, to prioritize exact match over like-sounding solutions that may have the same score.

comparing two string with different dimension

i have two string to compare but one is to long so i get no match
{"jojo","le-bizzarre-avventure-di-jojo-diamond-is-unbreakable"}

im using this matching
matchService.applyMatchByGroups(documentList)
with this type of document

   for (String str : input) {
                document2 = new Document.Builder(str)
                        .addElement(new Element.Builder<String>().setValue(str).setType(ElementType.NAME).createElement())
                        .setThreshold(0)
                        .createDocument();
                documentList.add(document2);
            }

SLF4J Failed to load

Your software looks fantastic. We will be using it for managing our existing sales leads database and verifying uniqueness of future entries.
The program worked with the sample data but generated a warning regarding the logger. Is there a fix for this? I plan to utilize this application a lot. Thank you for your efforts!

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Data: {[{'Steven Wilson'}, {'45th Avenue 5th st.'}]} Matched With: {[{'Stephen Wilkson'}, {'45th Ave 5th Street'}]} Score: 1.0000000000000002
Data: {[{'Stephen Wilkson'}, {'45th Ave 5th Street'}]} Matched With: {[{'Steven Wilson'}, {'45th Avenue 5th st.'}]} Score: 1.0000000000000002

Matching On Single Word

Hello, using the documented example, if I were to search on "Stephen" instead of "Stephen Wlkson", I get no matches returned. Is there a way to search so that a single word is matched to elements that are multiple words?

Need a mechanism to match elements by a matching key rather than by element type

Let us suppose we have documents with same elementTypes as follows

{
   name: NameElementType,
   spouseName: NameElementType
}

Need a mechanism to match these elements independently based on their key rather than element type. In the above example, we want to match all documents based on name and spouseName independently.

Address matching: street containing hyphens

Currently these addresses are not match at all:
Case 1:

Egger-Lienz-Strasse, 10 AT-4050 Traun
Egger Lienz Strasse, 10 AT-4050 Traun

Case 2:

Erdl 5a AT-4861 Schörfling am Attersee
Erdl 5b AT-4861 Schörfling/Attersee

Are there any option to improve the algorithm in order to consider space and hyphen as equal separators?

Regards,
Andrei.

Language Supported

Do we have support for Chinese language (character) fuzzy match in this product?

Faster response is appreciated.

Nicknames / different name spellings

In using your software to match people's names, I'm wondering how best to handle nicknames (e.g., Jim and James, Bill and William) and different spellings of the same name that are phonetically identical (e.g., Chris / Kris) -I'd actually expected Kris / Chris to match with soundex, but apparently not. Ideally Bill Smith would be a match for William Smith, and vice versa.
I could handle this by updating the PreProcessing or Tokenizer function, but I don't want to go re-inventing the wheel if you already have a better way of handling this, or plan to implement something soon.
Thanks,
Anton.

Start with non spring project

My project is a non-spring project. How can I include this dependency without messing up my code.

Exception thrown for dates prior to the epoch

Hi,
I'm encountering an exception when documents have a date element that is before or around the epoch (1970-01-01). This seems to be due to the TokenRange constructor converting dates to a numeric value based off Date.getTime() but not expecting negative values which can result in a range with lower > upper. I can dig further if helpful.

Exception seen:

Exception in thread "main" java.lang.IllegalArgumentException: fromKey > toKey
	at java.util.TreeMap$NavigableSubMap.<init>(TreeMap.java:1368)
	at java.util.TreeMap$AscendingSubMap.<init>(TreeMap.java:1855)
	at java.util.TreeMap.subMap(TreeMap.java:913)
	at java.util.TreeSet.subSet(TreeSet.java:325)
	at com.intuit.fuzzymatcher.component.TokenRepo$Repo.get(TokenRepo.java:80)
	at com.intuit.fuzzymatcher.component.TokenRepo.get(TokenRepo.java:36)
	at com.intuit.fuzzymatcher.component.ElementMatch.elementThresholdMatching(ElementMatch.java:35)
	at com.intuit.fuzzymatcher.component.ElementMatch.lambda$matchElement$1(ElementMatch.java:26)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at com.intuit.fuzzymatcher.component.ElementMatch.matchElement(ElementMatch.java:25)
	at com.intuit.fuzzymatcher.component.DocumentMatch.lambda$null$0(DocumentMatch.java:35)
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269)
	at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1556)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
	at com.intuit.fuzzymatcher.component.DocumentMatch.lambda$matchDocuments$1(DocumentMatch.java:36)
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313)
	at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
	at com.intuit.fuzzymatcher.component.MatchService.applyMatchByDocId(MatchService.java:115)
	at com.intuit.fuzzymatcher.component.MatchService.applyMatchByDocId(MatchService.java:81)
	...

Thanks,
Dave

Fuzzy matching issue : only fetching the exact match

I am testing with the below code. Here FuzzyTitle is an user defined class.

FuzzyTitle docSearch = new FuzzyTitle("1", "Match", "Max Studio", "Produced");
FuzzyTitle docAvailable1 = new FuzzyTitle("2", "Match", "Dream Work", "Produced");
FuzzyTitle docAvailable2 = new FuzzyTitle("3", "Match Rain", "Dream Work", "Released");

Document searchDocument = new Document.Builder(docSearch.getCounter()).addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docSearch.getTitle()).createElement()) .createDocument();

Document searchDocAvailable1 = new Document.Builder(docAvailable1.getCounter()) .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docAvailable1.getTitle()).setWeight(.70D).createElement()).createDocument();

Document searchDocAvailable2 = new Document.Builder(docAvailable2.getCounter()).addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docAvailable2.getTitle()).setWeight(.70D).createElement()).createDocument();

List documentList = new ArrayList<>();
documentList.add(searchDocAvailable1);
documentList.add(searchDocAvailable2);

MatchService matchService = new MatchService();
Map<Document, List<Match>> map;
map = matchService.applyMatch(searchDocument, documentList);
System.out.println("Match : " + map);

OutPut :
Match : {{[{'Match'}]}=[Match{data={[{'Match'}]}, matchedWith={[{'Match'}]}, score=1.0}]}

Query :

Ideally 1 and 3 should also match as both of them have similarity. Only exact match is coming. Am I doing something wrong here.
I want to distribute weightage to 2nd parameter as 70%, 3rd parameter as 20%, 4th parameter as 10%. So should I place weightage in all docs or only in available docs.
As in this case all parameters are of type ElementType.TEXT, so if this will trigger any wrong result putting all elements type as same TEXT.

I have kept the example simple. Definitely, I am doing something wrong. Requesting for help and clarification please.

intuit / fuzzy-matcher Goto Github PK

fuzzy-matcher's Issues

Recommend Projects

Recommend Topics

Recommend Org