Giter Site home page Giter Site logo

fuzzy-matcher's Introduction

Fuzzy-Matcher

Introduction

A java-based library to match and group "similar" elements in a collection of documents.

Imagine working in a system with a collection of contacts and wanting to match and categorize contacts with similar names, addresses or other attributes. The Fuzzy Match matching algorithm can help you do this. The Fuzzy Match algorithm can even help you find duplicate contacts, or prevent your system from adding duplicates.

This library can act on any domain object, like contact, and find similarity for various use cases. It dives deep into each character and finds out the probability that 2 or more objects are similar.

What's Fuzzy

The contacts "Steven Wilson" living at "45th Avenue 5th st." and "Stephen Wilkson" living at "45th Ave 5th Street" might look like belonging to the same person. It's easy for humans to ignore the small variance in spelling in names, or ignore abbreviation used in address. But for a computer program they are not the same. The string Steven does not equals Stephen and neither does Street equals st. If our trusted computers can start looking at each character and the sequence in which they appear, it might look similar. Fuzzy matching algorithms is all about providing this level of magnification to our myopic machines.

How does this work

Breaking down your data

This algorithm accepts data in a list of entities called Document (like a contact entity in your system), which can contain 1 or more Element (like names, address, emails, etc). Internally each element is further broken down into 1 or more Token which are then matched using configurable MatchType

This combination to tokenize the data and then to match them can extract similarity in a wide variety of data types

Exact word match

Consider these Elements defined in two different Documents

  • Wayne Grace Jr.
  • Grace Hilton Wayne

With a simple tokenization process each word here can be considered a token, and if another element has the same word they are scored on the number of matching tokens. In this example the words Wayne and Grace match 2 words out of 3 total in each elements. A scoring mechanism will match them with a result of 0.67

Soundex word match

Consider these Elements in two different Documents

  • Steven Wilson
  • Stephen Wilkson

Here we do not just look at each word, but encode it using Soundex which gives a unique code for the phonetic spelling of the name. So in this example words Steven & Stephen will encode to S315 whereas the words Wilson & Wilkson encode to W425.

This allows both the elements to match exactly, and score at 1.0

NGram token match

In cases where breaking down the Elements in words is not feasible, we split it using NGrams. Take for examples emails

Here if we ignore the domain name and take 3 character sequence (tri-gram) of the data, tokens will look like this

  • parker.james -> [par, ark, rke, ker, er., r.j, .ja, jam, ame, mes]
  • james_parker -> [jam, ame, mes, es_, s_p, _pa, par, ark, rke, ker]

Comparing these NGrams we have 7 out of the total 10 tokens match exactly which gives a score of 0.7

Nearest Neighbors match

In certain cases breaking down elements into tokens and comparing tokens is not an option. For example numeric values, like dollar amounts in a list of transactions

  • 100.54
  • 200.00
  • 100.00

Here the first and third could belong to the same transaction, where the third is only missing some precession. The match is done not on tokens being equal but on the closeness (the neighborhood range) in which the values appear. This closeness is again configurable where a 99% closeness, will match them with a score of 1.0

A similar example can be thought of with Dates, where dates that are near to each other might point to the same event.

Four Stages of Fuzzy Match

Fuzzy Match

We spoke in detail on Token and MatchType which is the core of fuzzy matching, and touched upon Scoring which gives the measure of matching similar data. PreProcessing your data is a simple yet powerful mechanism that can help in starting with clean data before running a match. These 4 stages which are highly customizable can be used to tune and match a wide variety of data types

  • Pre-Processing : This accepts a java Function. Which allows you to externally develop the pre-processing functionality and pass it to the library. Or use some of the existing ones. These are a few examples that are already available

    • Trim: Removes leading and trailing spaces (applied by default)
    • Lower Case: Converts all characters to lowercase (applied by default)
    • Remove Special Chars : Removes all characters except alpha and numeric characters and spaces. (default for TEXT type)
    • Numeric: Strips all non-numeric characters. Useful for numeric values like phone or ssn (default for NUMBER type)
    • Email: Strips away domain from an email. This prevents common domains like gmail.com, yahoo.com to be considered in match (default for EMAIL type)
  • Tokenization : This again accepts a Function so can be externally defined and fed to the library. Some commonly used are already available.

    • Word : Breaks down an element into words (anything delimited by space " ").
    • N-Gram : Breaks down an element into 3 letter grams.
    • Word-Soundex : Breaks down in words (space delimited) and gets Soundex encode using the Apache Soundex library
    • Value : Nothing to break down here, just uses the element value as token. Useful for Nearest Neighbor matches
  • Match Type : Allows 2 types of matches, which can be applied to each Element

    • Equality: Uses exact matches with token values.
    • Nearest Neighbor: Finds tokens that are contained in the neighborhood range, that can be specified as a probability (0.0 - 1.0) for each element. It defaults to 0.9
  • Scoring : These are defined for Element and Document matches

    • Element scoring: Uses a simple average, where for each element the matching token is divided by the total tokens. A configurable threshold can be set for each element beyond which elements are considered to match (default set at 0.3)
    • Document scoring: A similar approach where number of matching elements are compared with total element. In addition, each element can be give a weight. This is useful when some elements in a document are considered more significant than others. A threshold can also be specified at a document level (defaults to 0.5) beyond which documents are considered to match

End User Configuration

All the configurable options defined above can be applied at various points in the library.

Predefined Element Types

Below is the list of predefined Element Types available with sensible defaults. These can be overridden by setters while creating an Element.

Element Type PreProcessing Function Tokenizer Function Match Type
NAME namePreprocessing() wordSoundexEncodeTokenizer() EQUALITY
TEXT removeSpecialChars() wordTokenizer() EQUALITY
ADDRESS addressPreprocessing() wordSoundexEncodeTokenizer() EQUALITY
EMAIL removeDomain() triGramTokenizer() EQUALITY
PHONE numericValue() decaGramTokenizer() EQUALITY
NUMBER numberPreprocessing() valueTokenizer() NEAREST_NEIGHBORS
DATE none() valueTokenizer() NEAREST_NEIGHBORS
AGE numberPreprocessing() valueTokenizer() NEAREST_NEIGHBORS

Note: Since each element is unique in the way it should match, if you need to match a different element type than what is supported, please open a new GitHub Issue and the community will provide support and enhancement to this library

Document Configuration

  • Key: Required field indicating unique primary key of the document
  • Elements: Set of elements for each document
  • Threshold: A double value between 0.0 - 1.0, above which the document is considered as match.

Element Configuration

  • Value : String representation of the value to match
  • Type : These are predefined elements, which apply relevant functions for "PreProcessing", "Tokenization" and "MatchType"
  • Variance: (Optional) To differentiate same element types in a document. eg. a document containing 2 NAME element one for "user" and one for "spouse"
  • Threshold: A double value between 0.0 - 1.0, above which the element is considered as match.
  • Weight: A value applied to an element to increase or decrease the document score. The default is 1.0, any value above that will increase the document score if that element is matched.
  • PreProcessingFunction: Override The PreProcessingFunction function defined by Type
  • TokenizerFunction: Override The TokenizerFunction function defined by Type
  • MatchType: Override the MatchType defined by Type
  • NeighborhoodRange: Relevant only for NEAREST_NEIGHBORS MatchType. Defines how close should the Value be, to be considered a match. Accepted values between 0.0 - 1.0 (defaults to 0.9)

Match Service

It supports 3 ways to match the documents

  • Match a list of Documents: This is useful if you have an existing list of documents, and want to find out which of them might have potential duplicates. A typical de-dup use case
matchService.applyMatchByDocId(List<Document> documents)
  • Match a list of Documents with an Existing List: This is useful for matching a new list of documents with an existing list in your system. For example, if you're performing a bulk import and want to find out if any of them match with existing data
matchService.applyMatchByDocId(List<Document> documents, List<Document> matchWith)
  • Match a Document with Existing List: This is useful when a new document is being created and you want to ensure that a similar document does not already exist in your system
matchService.applyMatchByDocId(Document document, List<Document> matchWith)

Match Results

The response of the library is essentially a Match<Document> object. It has 3 attributes

  • Data: This is the source Document on which the match is applied
  • MatchedWith: This is the target Document that the data matched with
  • Result: This is the probability score between 0.0 - 1.0 indicating how similar the 2 documents are

The response is grouped by the Data.key, so from any of the MatchService methods the response is map

Map<String, List<Match<Document>>>

Quick Start

Maven Import

The library is published to maven central

<dependency>
    <groupId>com.intuit.fuzzymatcher</groupId>
    <artifactId>fuzzy-matcher</artifactId>
    <version>1.2.1</version>
</dependency>

(Note: This requires java 11. For java 8 use version 1.1.x)

Input

This library takes a collection of Document objects with various Elements as input.

For example, if you have a multiple contacts as a simple String Arrays

String[][] input = {
        {"1", "Steven Wilson", "45th Avenue 5th st."},
        {"2", "John Doe", "546 freeman ave"},
        {"3", "Stephen Wilkson", "45th Ave 5th Street"}
};

Convert them into List of Document

List<Document> documentList = Arrays.asList(input).stream().map(contact -> {
    return new Document.Builder(contact[0])
            .addElement(new Element.Builder<String>().setValue(contact[1]).setType(NAME).createElement())
            .addElement(new Element.Builder<String>().setValue(contact[2]).setType(ADDRESS).createElement())
            .createDocument();
}).collect(Collectors.toList());

Applying the Match

The entry point for running this program is through MatchService class. Create a new instance of Match service, and use applyMatch methods to find matches

MatchService matchService = new MatchService();
Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);

Output

This prints the result to console. This should show a match between the 1st and 3rd document, but not the 2nd.

result.entrySet().forEach(entry -> {
    entry.getValue().forEach(match -> {
        System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
    });
});

Performance

For most real life data-sets, the size of the data I am sure is not as simple as shown in the examples.
Since this library can be used to match elements against a large set of records, knowing how it performs is essential.

The performance characteristics varies primarily on MatchType being used

  • EQUALITY - For equality match, which is the default for most Element Types, the performance is linear O(N). Where N is the number of Element in all the document.

  • NEAREST_NEIGHBOR - The default for Numeric and Date Element Types the performance is O(N logN). This also depends on the NeighborhoodRange setting , the higher the value the better it will perform. It is advisable to not use 1.0 as a NeighborhoodRange and instead over-ride the MatchType to be EQUALITY, that way it guarantees a linear performance.

The following chart shows the performance characteristics of this library as the number of elements increase. As you can see, the library maintains a near-linear performance and can match thousands of elements within seconds on a multi-core processor.

Perf

fuzzy-matcher's People

Contributors

dependabot[bot] avatar gabe2001 avatar inspire99 avatar jdfalko avatar manishobhatia avatar mayurmadnani avatar naudzghebre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fuzzy-matcher's Issues

Domains: how to implement a new/different domain

Hi,
excellent library and I'd love to apply the functionality to other domains. The current code uses address details. As far as I can tell the Element classes would have to be rewritten to accommodate another domain.
Am I missing something?
cheers,
-gabe

More documentation on how to use

Hi @manishobhatia
I like the library and the documentation. But i was hoping if you can provide some more information on how to implement it in a project(like spring boot). As i am new to java i am having bit of problem in the implementation.

SLF4J Failed to load

Your software looks fantastic. We will be using it for managing our existing sales leads database and verifying uniqueness of future entries.
The program worked with the sample data but generated a warning regarding the logger. Is there a fix for this? I plan to utilize this application a lot. Thank you for your efforts!

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Data: {[{'Steven Wilson'}, {'45th Avenue 5th st.'}]} Matched With: {[{'Stephen Wilkson'}, {'45th Ave 5th Street'}]} Score: 1.0000000000000002
Data: {[{'Stephen Wilkson'}, {'45th Ave 5th Street'}]} Matched With: {[{'Steven Wilson'}, {'45th Avenue 5th st.'}]} Score: 1.0000000000000002

Error on tests

Hi !

I am getting this error when trying to install

`Tests run: 81, Failures: 0, Errors: 8, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.490 s
[INFO] Finished at: 2019-09-03T17:25:48-03:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project fuzzy-matcher: There are test failures.
[ERROR]
[ERROR] Please refer to c:\Users\Luccas Klotz\Downloads\fuzzy-matcher-master\target\surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException`

If I go to the \fuzzy-matcher-master\target\surefire-reports, errors are in the tests where files such as demo.csv and test-data.csv could not be found, but the files are there.

Any clues?

Thanks !

Threshold matching not changing for ElementType: DATE

I am trying to understand how to change the threshold matching for a set of 3 dates. "07/15/2019", "01/01/2020", and "01/02/2020". The matching score always comes back from 1.0 from for all three dates. I have tried changing the threshold on the Element level and the Document level and it doesn't make a difference. How can I change the matching so that a date 07/15/2019 does not match 01/01/2020 with the same score that 01/01/2020 matches 01/02/2020?
Here is my sample code: Similar to the junit test I found in your project, but modified for a springboot application.

@Component
public class FuzzylogicRunner implements CommandLineRunner {


    @Override
    public void run(String... args) {
        MatchService matchService = new MatchService();

        List<Object> dates = Arrays.asList(getDate("07/15/2019"), getDate("01/01/2020"), getDate("01/02/2020"));
        List<Document> documentList1 = getTestDocuments(dates, DATE, null);
        Map<Document, List<Match<Document>>> result1 = matchService.applyMatch(documentList1);

        result1.forEach((key, value) -> value.forEach(match -> {
            System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
        }));
    }

    private List<Document> getTestDocuments(List<Object> values, ElementType elementType, Double neighborhoodRange) {
        AtomicInteger ai = new AtomicInteger(0);
        return values.stream().map(num -> {
            Element.Builder elementBuilder = new Element.Builder().setType(elementType).setValue(num).setThreshold(0.1);
            if (neighborhoodRange != null) {
                elementBuilder.setNeighborhoodRange(neighborhoodRange);
            }
            return new Document.Builder(Integer.toString(ai.incrementAndGet()))
                    .addElement(elementBuilder.createElement()).setThreshold(0.1)
                    .createDocument();
        }).collect(Collectors.toList());
    }

    private Date getDate(String val) {
        DateFormat df = new SimpleDateFormat("MM/dd/yyyy");
        try {
            return df.parse(val);
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return null;
    }
}

The output is always (even if I modify the threshold from 0.1 to 0.9:
Data: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Matched With: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Score: 1.0
Data: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Matched With: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Score: 1.0
Data: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Matched With: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Score: 1.0
Data: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Matched With: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Score: 1.0
Data: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Matched With: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Score: 1.0
Data: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Matched With: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Score: 1.0

Full name matching

Quite new here..

I'm looking into using fuzzy-matcher in a sanction list monitoring service (will share on GitHub once ready) but I'm facing an issue with figuring out the best way to address full names

Example:
Sanction lists provides a name such as "first last" but matching against "first middle1 middle2 last".

What is the best approach to match the latter to the first? In other words, ignore "middle1 middle2" from the query? I can programatically tokenize the search query and run it for various combinations. Furthermore, in case of a middle initial (I saw an issue earlier that was not resolved), can a weight be given to the initial?

For example "First M Last" should get a higher score compared to "First Last".

Any feedback would be greatly appreciated.

Though there is matching result but matcher is not returning.

Hi Manish,
I am trying to make use of fuzzy-matcher for which i have written one sample test class.( the code is appended below).
I am surprised to know the system is not returning match record. I am not able to figure out what wrong has been done.

could you please help?
Amarjeet

// Java code start here
package test;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

import com.intuit.fuzzymatcher.component.MatchService;
import com.intuit.fuzzymatcher.domain.Document;
import com.intuit.fuzzymatcher.domain.Element;
import com.intuit.fuzzymatcher.domain.ElementType;
import com.intuit.fuzzymatcher.domain.Match;

public class SampleTest {

public static void main(String[] args) {
	MatchService matchService = new MatchService();
	SampleTest test = new SampleTest();
	Map<Document, List<Match<Document>>> map;
    
	
	List<Document> matchDocument = test.loadList();
	
	Customer searchCust = new Customer();
	searchCust.setId(1);
	searchCust.setfName("Amarjeet");
	searchCust.setmName("Rampher");
	searchCust.setlName("Tiwari");
	searchCust.setAadhar("123456789012");
	//searchCust.setAddress("25 avenue");
	//searchCust.setPan("asdf345ghb");
	Document searchDocument = new Document.Builder(""+searchCust.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getAadhar()).createElement())
            //.addElement(new Element.Builder().setType(ElementType.TEXT).setValue(searchCust.getPan()).createElement())
            .createDocument();
	
	List<Document> documentList = new ArrayList<>();
    documentList.add(searchDocument);
    
    map = matchService.applyMatch(matchDocument, documentList);
    System.out.println("Match : " + map);
	Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);	 
	System.out.println("Match : " + result);
	
	
}


public class MatchServiceFactory {
	
    private MatchService matchService = new MatchService();

    public MatchService getMatchService() {
        return this.matchService;
   }
}

//Load the list 
public List<Document> loadList() {
	Customer cust = new Customer();
	Customer cust1 = new Customer();
	Customer cust2 = new Customer();
	Customer cust3 = new Customer();
	Customer cust4 = new Customer();
	Customer cust5 = new Customer();
	Customer cust6 = new Customer();
	Customer cust7 = new Customer();
	Customer cust8 = new Customer();
	Customer cust9 = new Customer();
	Customer cust10 = new Customer();
	cust.setId(1);
	cust.setfName("Amarjeet");
	cust.setmName("Rampher");
	cust.setlName("Tiwari");
	cust.setAadhar("123456789012");
	cust.setAddress("137 hopkins ave jersey city");
	cust.setPan("adopt8451z");
	
	cust1.setId(2);
	cust1.setfName("fname2");
	cust1.setmName("mname2");
	cust1.setlName("lname3");
	cust1.setAadhar("12233434");
	cust1.setAddress("25 avenue2");
	cust1.setPan("asdf345ghb2");

	cust2.setId(3);
	cust2.setfName("fname3");
	cust2.setmName("mname3");
	cust2.setlName("lname3");
	cust2.setAadhar("122334343");
	cust2.setAddress("25 avenue3");
	cust2.setPan("asdf345ghb3");	
	
	Document searchDocument = new Document.Builder(""+cust.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getAadhar()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust.getPan()).createElement())
            .createDocument();

	Document searchDocument1 = new Document.Builder(""+cust1.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getAadhar()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust1.getPan()).createElement())

            .createDocument();

	Document searchDocument2 = new Document.Builder(""+cust2.getId())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getfName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getmName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getlName()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getAadhar()).createElement())
            .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(cust2.getPan()).createElement())
            .createDocument();
	    
	List<Document> documentList = new ArrayList<>();
    documentList.add(searchDocument);
    documentList.add(searchDocument1);
    documentList.add(searchDocument2);
    return documentList;
}

}

class Customer{
	int id;
	String fName;
	String mName;
	String lName;
	String address;
	String pan;
	String aadhar;
	
	public int getId() {
		return id;
	}
	public void setId(int id) {
		this.id = id;
	}
	public String getfName() {
		return fName;
	}
	public void setfName(String fName) {
		this.fName = fName;
	}
	public String getmName() {
		return mName;
	}
	public void setmName(String mName) {
		this.mName = mName;
	}
	public String getlName() {
		return lName;
	}
	public void setlName(String lName) {
		this.lName = lName;
	}
	public String getAddress() {
		return address;
	}
	public void setAddress(String address) {
		this.address = address;
	}
	public String getPan() {
		return pan;
	}
	public void setPan(String pan) {
		this.pan = pan;
	}
	public String getAadhar() {
		return aadhar;
	}
	public void setAadhar(String aadhar) {
		this.aadhar = aadhar;
	}
}

// Java code ends here

Exception thrown for dates prior to the epoch

Hi,
I'm encountering an exception when documents have a date element that is before or around the epoch (1970-01-01). This seems to be due to the TokenRange constructor converting dates to a numeric value based off Date.getTime() but not expecting negative values which can result in a range with lower > upper. I can dig further if helpful.

Exception seen:

Exception in thread "main" java.lang.IllegalArgumentException: fromKey > toKey
	at java.util.TreeMap$NavigableSubMap.<init>(TreeMap.java:1368)
	at java.util.TreeMap$AscendingSubMap.<init>(TreeMap.java:1855)
	at java.util.TreeMap.subMap(TreeMap.java:913)
	at java.util.TreeSet.subSet(TreeSet.java:325)
	at com.intuit.fuzzymatcher.component.TokenRepo$Repo.get(TokenRepo.java:80)
	at com.intuit.fuzzymatcher.component.TokenRepo.get(TokenRepo.java:36)
	at com.intuit.fuzzymatcher.component.ElementMatch.elementThresholdMatching(ElementMatch.java:35)
	at com.intuit.fuzzymatcher.component.ElementMatch.lambda$matchElement$1(ElementMatch.java:26)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at com.intuit.fuzzymatcher.component.ElementMatch.matchElement(ElementMatch.java:25)
	at com.intuit.fuzzymatcher.component.DocumentMatch.lambda$null$0(DocumentMatch.java:35)
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269)
	at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1556)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
	at com.intuit.fuzzymatcher.component.DocumentMatch.lambda$matchDocuments$1(DocumentMatch.java:36)
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313)
	at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
	at com.intuit.fuzzymatcher.component.MatchService.applyMatchByDocId(MatchService.java:115)
	at com.intuit.fuzzymatcher.component.MatchService.applyMatchByDocId(MatchService.java:81)
	...

Thanks,
Dave

Questions

Thanks for creating this library! I have not looked into the code in detail, but would like to ask a few questions?

  • it looks like the "soundex" functionality from Apache is heavily biased to the English language only?

  • Do you have any estimates of the performance/complexity of the algorithms involved? I am considering this library for searching/iterating through in-memory data objects that have no other indexing.

Matching On Single Word

Hello, using the documented example, if I were to search on "Stephen" instead of "Stephen Wlkson", I get no matches returned. Is there a way to search so that a single word is matched to elements that are multiple words?

Fuzzy matching issue : only fetching the exact match

I am testing with the below code. Here FuzzyTitle is an user defined class.

FuzzyTitle docSearch = new FuzzyTitle("1", "Match", "Max Studio", "Produced");
FuzzyTitle docAvailable1 = new FuzzyTitle("2", "Match", "Dream Work", "Produced");
FuzzyTitle docAvailable2 = new FuzzyTitle("3", "Match Rain", "Dream Work", "Released");

Document searchDocument = new Document.Builder(docSearch.getCounter()).addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docSearch.getTitle()).createElement()) .createDocument();

Document searchDocAvailable1 = new Document.Builder(docAvailable1.getCounter()) .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docAvailable1.getTitle()).setWeight(.70D).createElement()).createDocument();

Document searchDocAvailable2 = new Document.Builder(docAvailable2.getCounter()).addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docAvailable2.getTitle()).setWeight(.70D).createElement()).createDocument();

List documentList = new ArrayList<>();
documentList.add(searchDocAvailable1);
documentList.add(searchDocAvailable2);

MatchService matchService = new MatchService();
Map<Document, List<Match>> map;
map = matchService.applyMatch(searchDocument, documentList);
System.out.println("Match : " + map);

OutPut :
Match : {{[{'Match'}]}=[Match{data={[{'Match'}]}, matchedWith={[{'Match'}]}, score=1.0}]}

Query :

  1. Ideally 1 and 3 should also match as both of them have similarity. Only exact match is coming. Am I doing something wrong here.
  2. I want to distribute weightage to 2nd parameter as 70%, 3rd parameter as 20%, 4th parameter as 10%. So should I place weightage in all docs or only in available docs.
  3. As in this case all parameters are of type ElementType.TEXT, so if this will trigger any wrong result putting all elements type as same TEXT.

I have kept the example simple. Definitely, I am doing something wrong. Requesting for help and clarification please.

IllegalStateException: stream has already been operated upon or closed

I got the following exception when attempting to match a Document.

Caused by: java.lang.IllegalStateException: stream has already been operated upon or closed
	at java.util.stream.AbstractPipeline.<init>(AbstractPipeline.java:203) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.<init>(ReferencePipeline.java:94) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$StatelessOp.<init>(ReferencePipeline.java:618) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$2.<init>(ReferencePipeline.java:163) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.filter(ReferencePipeline.java:162) ~[na:1.8.0_181]
	at com.intuit.fuzzymatcher.domain.Token.getSearchGroups(Token.java:55) ~[fuzzy-matcher-0.4.2.jar:na]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.lambda$null$12(MatchOptimizerFunction.java:177) ~[fuzzy-matcher-0.4.2.jar:na]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[na:1.8.0_181]
	at java.util.stream.DistinctOps$1$2.accept(DistinctOps.java:175) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_181]
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[na:1.8.0_181]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.lambda$initializeSearchGroups$13(MatchOptimizerFunction.java:171) ~[fuzzy-matcher-0.4.2.jar:na]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_181]
	at java.util.stream.DistinctOps$1$2.accept(DistinctOps.java:175) ~[na:1.8.0_181]
	at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1625) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[na:1.8.0_181]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_181]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[na:1.8.0_181]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.initializeSearchGroups(MatchOptimizerFunction.java:166) ~[fuzzy-matcher-0.4.2.jar:na]
	at com.intuit.fuzzymatcher.function.MatchOptimizerFunction.lambda$searchGroupOptimizer$8(MatchOptimizerFunction.java:148) ~[fuzzy-matcher-0.4.2.jar:na]
	at com.intuit.fuzzymatcher.component.TokenMatch.lambda$matchTokens$1(TokenMatch.java:26) ~[fuzzy-matcher-0.4.2.jar:na]
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) ~[na:1.8.0_181]
	at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1696) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_181]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_181]
	at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) ~[na:1.8.0_181]
	at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721) ~[na:1.8.0_181]
	at java.util.stream.AbstractTask.compute(AbstractTask.java:316) ~[na:1.8.0_181]
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinPool$WorkQueue.execLocalTasks(ForkJoinPool.java:1040) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1058) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[na:1.8.0_181]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[na:1.8.0_181]

New Element Type for product names

I've been trying to compare model names of different machines.
I played around with the configuration of my elements but I'm not getting good results.

I would like to get a high match score if one string contains the similar smaller string

STP375S-B60/Wnh_1500V_20V02_1756
STP 375S-B60/Wnh

If there is a configuration that would result in my desired score that would be the best but if not then I suggest having a new Element type "Product Designation" that could compare strings like "M80 PDF C" sensibly.

Trying to match names with initials

Hi,
I am trying to match names such a A Mathur with ABhishek Mathur or Donald Trump with D Trump. Is there some simple parameter i can adjust to allow that ?

Thanks
Abhi

Is there any way to create my own matchers?

Hi @manishobhatia,

I couldn't find a way to create my own matcher trying to solve the following problem:

I have document numbers and I would like to compare them with others by the digit they share. For instance, document A = 12305211C LPZ, document B = 12105321C CBA.
The matching score is 1.0

Explanation:

Input A:
After preprocessing: 12305211, (deleting non-digit chars)
After tokenization: int[] occurrencesA, for instance occurrencesA[1] is 3

Input B:
After preprocessing: 12105321C, (deleting non-digit chars)
After tokenization: int[] occurrencesB, for instance occurrencesB[1] is 3

Matching Function A with B: (number of occurrences) / (length of A).

I would appreciate it if you have any workaround or approach to solve the problem with the library.

Thank you very much,
I wish you the best

Language Supported

Do we have support for Chinese language (character) fuzzy match in this product?

Faster response is appreciated.

Phone number assumed to be a US number

Hi @manishobhatia,
In the pre-processing steps for a phone number country code '1' gets added to the number making it specific to US.
I would like to work on the fix for this. Here's what I have in mind. As part of normalization, will not add country code but rather remove if present and so that phone number match happens without the country code.

Help

Sry if i ask, i want to learn how to do libraries like this and i dont understand how to test your project, really sorry to bother u

Matching two strings

Hi,

I came across this library and tried some very elementary test, matching two strings to find the score.
After going through readme file, several null pointer exceptions and issues already raised, I finally able to play something around that by doing this:
`List input = Arrays.asList(firstStr, secondStr);

	Document document = null;
	List<Document> documentList = new ArrayList<>();
	for (String str : input) {
		document = new Document.Builder(str)
				.addElement(new Element.Builder<String>().setValue(str).setType(ElementType.TEXT).createElement())
				.createDocument();
		documentList.add(document);
	}
	Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);

	result.entrySet().forEach(entry -> {
		entry.getValue().forEach(match -> {
			System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: "
					+ match.getScore().getResult());
		});
	});`

However it is found that:

  1. The strings match each other twice, i.e firstStr matches to secondStr and then vice-versa while only a single match was required.
  2. I tried string with numbers like "Nayan J Bayan 123" with elementType TEXT, the result came out as a blank map. What should be the element type if I want to check strings containing numbers and special characters? I also tried with ADDRESS type and same blank map was returned.
  3. I don't think at all my implementation is correct. What should have been the correct implementation?

Thanks

Information on Library usage

Hi,
Can i use this library for matching payment and invoice records.
let's say on payment side i have 5-6 properties { invoice-number, invoice-amount, customer-name, invoice-date, customer-email}
Note : These properties are extracted for payments through the advice document we receive.
Same properties will be available for invoice as well { invoice-number, invoice-amount, customer-name, invoice-date, customer-email}

Can i use this library to find which invoice is the most suitable match for the payment

For example - let's consider there is a payment and 6invoices

Will i be able to score all the invoices against a payment

payment 1 - invoice 1 - score is 0.9
payment 1 - invoice 2 - score is 0.8
payment 1 - invoice 3 - score is 0.7
payment 1 - invoice 4 - score is 0.6
payment 1 - invoice 5 - score is 0.65
payment 1 - invoice 6 - score is 0.77

Date Match with NeighborhoodRange greater than 0.91 fails to give valid results

DATE element type allows user to override the NeighborhoodRange , but a value of greater than 0.91 causes poor matches to show up.

This is a test case to trigger it in MatchServiceTest.java that can show the failure

@Test
    public void itShouldApplyMatchWithDate() {
        List<Object> dates = Arrays.asList(getDate("01/01/2020"), getDate("01/02/2020"), getDate("07/15/2019"));
        List<Document> documentList = getTestDocuments(dates, DATE, 0.91);
        Map<Document, List<Match<Document>>> result = matchService.applyMatch(documentList);
        result.entrySet().forEach(entry -> {
            entry.getValue().forEach(match -> {
                System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
            });
        });

        Assert.assertEquals(2, result.size());
    }

As we increate the value greater than 0.91, dates that are not in the neighborhood shows up in results.

The issue is primarily in the incorrect usage of this

private static final double DATE_SCALE_FACTOR = 1.1;

It is increasing the TokenRanges lower and higher bounds to broader values, causing incorrect matches to show up.

Support for configuring abbreviated names

We have a situation where regular abbreviations of names is not matching the proper name and not sure if that is supported or not.

For example:

'Barney' - an abbreviation of 'Barnaby' will not match with 'Barnaby'.

If not currently supported I was wondering if it's worth considering supporting it by allowing an app to supply a map of abbreviations to known names so that if, say, all other matching fails then if any 'NAME' type components of the match document contain abbreviations in the map then a substitution could be made with the proper name and a match reattempted.

Address matching: street containing hyphens

Currently these addresses are not match at all:
Case 1:

  • Egger-Lienz-Strasse, 10 AT-4050 Traun
  • Egger Lienz Strasse, 10 AT-4050 Traun

Case 2:

  • Erdl 5a AT-4861 Schörfling am Attersee
  • Erdl 5b AT-4861 Schörfling/Attersee

Are there any option to improve the algorithm in order to consider space and hyphen as equal separators?

Regards,
Andrei.

Combine Tokenizers for better results

Hi, I had problems using a single tokenizer for matching names.
The wordSoundexEncodeTokenizer was matching as equal two different names. I was matching "Caputo" and the MatchService returned "Caputo" and "Chabot" with equal score.
The wordTokenizer was skipping "Nikolau" as the correct match was "Nikolaou".
The triGramTokenizer was skipping "Leao", when there was a direct match with "Rafael Leao".

I found a temporary solution concatenating the Tokenizers with a custom method:

@SafeVarargs
    public static <T> Function<Element<T>, Stream<Token<T>>> concatTokenizers(Function<Element<T>, Stream<Token<T>>>... funct) {
        return element -> Arrays.stream(funct).flatMap(fun -> fun.apply(element));
    }

and using it like

                .setTokenizerFunction(concatTokenizers(
                        TokenizerFunction.wordTokenizer(),
                        TokenizerFunction.wordSoundexEncodeTokenizer(),
                        TokenizerFunction.triGramTokenizer()
                ))

I'm not sure if this is the correct approach, but I hope that the function is helpful to others with the same problem.
If the solution is correct I would like to have it inserted in the library.

The results after the use of the function were as expected and all items matched perfectly, but I would suggest a further development with, if possible, a weight given to the tokenizers or listing the tokenizers in order and when one gives no results the next one is used, to prioritize exact match over like-sounding solutions that may have the same score.

comparing two string with different dimension

i have two string to compare but one is to long so i get no match
{"jojo","le-bizzarre-avventure-di-jojo-diamond-is-unbreakable"}

im using this matching
matchService.applyMatchByGroups(documentList)
with this type of document

   for (String str : input) {
                document2 = new Document.Builder(str)
                        .addElement(new Element.Builder<String>().setValue(str).setType(ElementType.NAME).createElement())
                        .setThreshold(0)
                        .createDocument();
                documentList.add(document2);
            }

Find the most similar array

Hi,

Say I have a list of sentences Collection<String> sentences and I want to simply find the most similar among these to another sentence. Is this possible ?

I read the first README file and example there only ingests one input (documents)

Thanks !

Nicknames / different name spellings

In using your software to match people's names, I'm wondering how best to handle nicknames (e.g., Jim and James, Bill and William) and different spellings of the same name that are phonetically identical (e.g., Chris / Kris) -I'd actually expected Kris / Chris to match with soundex, but apparently not. Ideally Bill Smith would be a match for William Smith, and vice versa.
I could handle this by updating the PreProcessing or Tokenizer function, but I don't want to go re-inventing the wheel if you already have a better way of handling this, or plan to implement something soon.
Thanks,
Anton.

Support for Age Element type

On similar lines to NUMBER element , can we can an AGE element.
The expectation here is users age can change by a small amount if captured at different times.

So we should see a higher score if the age values are closer (around 1 or 2 years difference) and gets progressively lower with 0 score for age difference over 5 years

Upgrade to Java 11

Java 8 is nearing EOL, looking to upgrade this project to support Java 11.

  • Identify dependent libraries support for java 11 and upgrade them if necessary
  • Unit test should pass and no significant performance impact should be seen

Aggregate strings based on similarity into groups.

Is there a way how to aggregate Strings based on similarity into groups? I have list of strings List<String> strings and I need to group them.

MatchService provides applyMatchByGroups which is very close, but in the inner Set contains every match for the group I need instead of only unique values which is what I need.

Thanks in advance for the answer.

Inconsistency in matching results

Hi there!

I am running some matching tests with your library on a large dataset of bank transactions. The documents used for matching have a single TEXT field, containing the description of the transaction. On running the matching, I found that some strings that should be an exact match, come back with a probability as low as 0.5. See the below sample:

{[{'Immediate Payment Fee,'}]}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 1.0}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}
Matching {{[{'Immediate Payment Fee,'}]}, 0.6666666666666666}

The top line is the document that is matched and the other lines are the matching results. The implementation of my test is as straightforward as can be, I believe:

            // Build document input
            List<Document> documents = new ArrayList<>();
            transactions.stream().forEach(t -> {

                documents.add(new Document.Builder(String.valueOf(index.incrementAndGet()))
                    .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(t.getDescription()).createElement())
                    .createDocument());
            });

            MatchService matchService = new MatchService();
            Map<Document, List<Match<Document>>> matched = matchService.applyMatch(documents);

            for (Iterator<Document> i = matched.keySet().iterator(); i.hasNext(); ) {

                Document d = i.next();
                System.out.println("");
                System.out.println(d.toString());
                matched.get(d).stream().forEach(doc -> {

                    System.out.println("Matching " + doc.toString());
                });
            }

I'd love to use your library for my project and hope to contribute, but I think I need to first understand why not all matches are being returned as 1.0 as all the strings are identical.

Thanks!

Support for distinct groups of similar strings

Here is the scenario for input like String[][] input = {
{"1", "Nike"},
{"2", "Puma"},
{"3", "Niket dhaka"},
{"4","Levi's"},
{"5","Levi's Fashion for Men and Women"},
{"6","Nike"},
{"7","Puma Sports and Fitness"},
{"8","Puma Shoes"},
{"9","Fashion Nova"},
{"10","H&M Fashion"},
{"11","Nike Sports"}};
the groups formed are something like this: [
{
"0": {
"id": 11,
"name": "Nike Sports"
},
"1": {
"id": 1,
"name": "Nike"
},
"2": {
"id": 6,
"name": "Nike"
}
},
{
"0": {
"id": 1,
"name": "Nike"
},
"1": {
"id": 6,
"name": "Nike"
},
"2": {
"id": 11,
"name": "Nike Sports"
}
},
{
"0": {
"id": 2,
"name": "Puma"
},
"1": {
"id": 8,
"name": "Puma Shoes"
}
},
{
"0": {
"id": 6,
"name": "Nike"
},
"1": {
"id": 1,
"name": "Nike" },
"2": {
"id": 11,
"name": "Nike Sports"
}}, {
"0": {
"id": 8,
"name": "Puma Shoes"
},
"1": {
"id": 2,
"name": "Puma"
}},{"0": {
"id": 9,
"name": "Fashion Nova"},
"1": {
"id": 10,
"name": "HM Fashion"}},{
"0": {
"id": 10,
"name": "HM Fashion"},
"1": {
"id": 9,
"name": "Fashion Nova"
}}]

You can see groups with id 11,1,6 and 1,6,11 are formed. is there any way to get only distinct groups?

Kotlin not support

Please provide the kotlin language support. In kotlin, always result object is empty.

Need a mechanism to match elements by a matching key rather than by element type

Let us suppose we have documents with same elementTypes as follows

{
   name: NameElementType,
   spouseName: NameElementType
}

Need a mechanism to match these elements independently based on their key rather than element type. In the above example, we want to match all documents based on name and spouseName independently.

Some issues with the matching

Hi Manish,

This is an extremely neat tool you have developed, kudos!! I have been playing around with it and have run into a couple of issues. I was hoping you would help me resolve them.

I have been trying to match a single records against a database of records, and have been going up from 10000 to 500000. This is how I have been configuring the database:

new Document.Builder(csv[0])
.addElement(new Element.Builder().setType(NAME).setValue(getName(csv)).createElement())
.addElement(new Element.Builder().setType(ADDRESS).setValue(getAddress(csv)).createElement())
.addElement(new Element.Builder().setType(PHONE).setValue(csv[8]).setWeight(2)
.setThreshold(0.5).createElement())
.addElement(new Element.Builder().setType(EMAIL).setValue(csv[10]).createElement())
.setThreshold(0.5)
.createDocument();

But I am seeing some anomalies.

  1. I am always seeing one record being returned, whereas I am expecting all records with a threshold greater than 0.5 . And I know that in each case, there are multiple records that should pass the threshold. This is how I am printing the records:
    result.entrySet().forEach(entry -> {
    entry.getValue().forEach(match -> {
    System.out.println("Person searched: " + match.getData() + "\nMatched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
    });
    });

I don't have a unique identifier for each record. this is how my CSV looks like:
"first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"
Do you think that might be the issue?

  1. I am seeing something like this:
    Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
    Matched With: {[{'Susanna Desiga'}, {'4 W Broad St San Juan Capistrano Orange CA 92675'}, {'[email protected]'}, {'949-622-6261'}]} Score: 0.8668599263800767

Given the only thing remotely in common is the first name, I am wondering why there is such a high matching score. Whereas something like:
Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'[email protected]'}, {'732-478-7394'}]} Score: 0.7142857142857143
which is actually a better match is only getting a score of 0.7142.
Other "bad" match, but good score examples:
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
Matched With: {[{'Susanna Fedak'}, {'4983 Mcallister St Cambridge Middlesex MA 02138'}, {'[email protected]'}, {'617-357-4376'}]} Score: 0.7142857142857143
searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'[email protected]'}, {'7324787395'}]}
Matched With: {[{'Susanna Molavi'}, {'17389 Market St #8 Pearl City Honolulu HI 96782'}, {'[email protected]'}, {'808-723-3110'}]} Score: 0.7142857142857143

And only these "bad" matches were returned despite the "good" match being present in the database:
{[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'[email protected]'}, {'732-478-7394'}]} Score: 0.7142857142857143

Any suggestions on how to tune the match database to get better results than this? Would be greatly appreciated!

FYI, the data I am using here is all fake.

Thanks!
Vishal.

Why Does Matching Fail in These Scenarios?

Hello,

I tried to modify some names for matching [Ahmed, Mohamed, Jouliana] by making slight alterations and then attempted to find matches. I anticipated matches like (1,2), (3,4), (3,5), (4,5), and (6,7).

Here's the code snippet I used:

String[][] input = {
    {"1", "Ahmed"},
    {"2", "Ahme"},
    {"3", "Mohamed"},
    {"4", "Mohame"},
    {"5", "Mahamad"},
    {"6", "Jouliana"},
    {"7", "Jouli"}
};

List<Document> documentList = Arrays.asList(input).stream().map(contact -> {
    return new Document.Builder(contact[0])
            .addElement(new Element.Builder<String>().setValue(contact[1]).setType(NAME).createElement())
            // .addElement(new Element.Builder<String>().setValue(contact[1]).setType(TEXT).createElement())
            .createDocument();
}).collect(Collectors.toList());

However, I only received the following result when the type is set to NAME:

Data: {[{'Mohamed'}]} Matched With: {[{'Mahamad'}]} Score: 1.0
Data: {[{'Mahamad'}]} Matched With: {[{'Mohamed'}]} Score: 1.0

And no results were obtained when the type is set to TEXT.

How to use getScore in Element class? what is the matchingCount?

Hello 👋,

I've got a look at the code snippet provided in Element class:

public double getScore(Integer matchingCount, Element other) {
    return (double)matchingCount / (double)this.getChildCount(other);
}

public long getChildCount(Matchable other) {
    if (other instanceof Element) {
        Element<T> o = (Element)other;
        return (long)Math.max(this.getTokens().size(), o.getTokens().size());
    } else {
        return 0L;
    }
}

I'm doing matching between two Documents (Rows) and aiming to obtain the score for each element in the match. I presume each document contains only one document match, which has the highest matching score (row-wise).

I primarily encounter two issues:

  • What does matchingCount represent? I would appreciate any examples.
  • How can I map between elements in document 1 with their corresponding elements in document 2? I notice that elements don't have a key attribute or something similar that could help in establishing the mapping.

Thank you.

Name List matcher

My use case is to compare individuals from an ancestry application. I need the ability to check on a list of sibling names and children names.

I am currently using the library to check names, gender, birthdates and such. These are all single values.

Now imagine I have a random number of siblings. I would like to call it a match if at least one sibling name matches between individual A and B. I also would like to do the same with a list of child names.

I have not thought of a way to use the existing functionality to do this. Perhaps it can be done. I have only been using the library for a week. If so, please tell me how. If not, I am asking that the library be extended to support a list of names as a type.
I could imagine it being used like this:

.addElement(new Element.Builder<List>()
.setValue(individual.getSiblingNames())
.setVariance("Siblings")
.setType(ElementType.NAME_LIST)
.setWeight(0.3)
.createElement())

Maybe a setting for how many name matches are required for it to fuzzy match?
I would be satisfied with just one, but it could be more generic.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.