Fuzzy-Matcher

Introduction
How does this work
- Four Stages of Fuzzy Match
Building the Library
- Prerequisite
- Compiling and installing locally
Using the Library
End User Configuration
Demo

Introduction

A java based library to match and group "similar" elements in a collection of document.

In system with a collection of contacts. If we wanted to match and categorize contacts with similar names, location where they live or with any other attribute. This matching algorithm helps to achieve it using Fuzzy match. In fact, you may even try to find out if you have already added duplicate contacts, or you can use this to prevent your system from adding one.

This library can act on any domain object like contact and find similarity for various use cases. It dives deep into each character and finds out the probability that 2 or more of these objects are similar.

How does this work

Four Stages of Fuzzy Match

The input provided goes through 4 stages to come up with a match result. These 4 stages are functions defined which can be easily configured by passing a lambda expression.

Pre-Processing : Expects a Function<String, String> which a simple transformation from a String to another.
- Trim: Removes leading and trailing spaces (applied by default)
- Lower Case: Converts all characters to lower case (applied by default)
- Remove Special Chars : Removes all characters except alphabets and numeric chars and space. (default for TEXT type)
- Numeric: Strips all non-numeric characters. Useful for numeric values like phone or ssn (default for PHONE type)
- Email: Strips away domain from an email. This prevents common domains like gmail.com, yahoo.com to be considered in match (default for EMAIL type)
Tokenization : Expects a Function<Element, Stream<Token>> which will break down each Element into Token object used for comparison.
- Word : Breaks down an element into words (anything delimited by space " "). e.g. "123 some street" -> ["123" "some" "street"]
- N-Gram : Breaks down an element into 3 letter grams. e.g. "jparker" -> ["jpa","par","ark","rke","ker"]
Similarity Match : Expects a BiFunction<Token, Token, Double>, which gives a match probability (0.0 to 1.0) between 2 tokens
- Soundex: Compares 2 token strings soundex values. Uses apache commons codec library to get soundex values. The result of this match is a binary either 0.0 or 1.0
- Equality: Does a string equals of 2 token strings. Again the result is either 0.0 or 1.0
- Levenshtein: Gets the Levenshtein distance score using apache commons similarity library
- Jaccard: Gets the Jaccard score using apache commons similarity library
Scoring : Expects a Function<Match, Double>, this defines functions on how to accumulate scores from Tokens into Elements and from Elements into Documents
- Average: Adds up total scores of each child matches / total children. This is the default scoring for Elements
- Weighted Average: This is useful for Document Scoring, where users can input weights on elements. Example a phone number or email could be considered an important element to identify match between 2 User objects, and we can add weights to such elements.
- Exponential Average: Again useful for Document Scoring, where if more than 1 element match, we can increase the scoring exponentially
- Exponential Weighted Average: Uses both an exponents and weights for scoring. This is the default for Document Scoring

Building the Library

Prerequisite

You need Java SDK v1.8 or higher. Before you begin, you should check your current Java installation by using the following command: java -version

fuzzy-match is compatible with Apache Maven 4.0 or above. If you do not already have Maven installed, you can follow the instructions at maven.apache.org.

On many operating systems, Maven can be installed with a package manager.
If you use OSX Homebrew, try brew install maven.
Ubuntu users can run sudo apt-get install maven.
Windows users with Chocolatey can run choco install maven from an elevated (administrator) prompt.

Compiling and installing locally

After cloning the project locally. Run this command to compile, test and install the project

mvn clean install

Using the Library

Maven Import

The library is pusblished to maven central

<dependency>
    <groupId>com.intuit.fuzzymatcher</groupId>
    <artifactId>fuzzy-matcher</artifactId>
    <version>0.1.0</version>
</dependency>

Input

This library take a collection of Document object with various Element as input.

For example, if you have a User object in your system, you can easily convert it to a Document with this simple builder pattern provided

new Document.Builder(User.getId())
    .addElement(new Element.Builder().setType(TEXT).setValue(User.getName()).createElement())
    .addElement(new Element.Builder().setType(ADDRESS).setValue(User.getAddress()).createElement())
    .addElement(new Element.Builder().setType(PHONE).setValue(User.getPhone()).createElement())
    .addElement(new Element.Builder().setType(EMAIL).setValue(User.getEmail()).createElement())
    .createDocument();

The Element Types like "TEXT", "ADDRESS", "PHONE", "EMAIL" are just simple ways to define custom stages (or functions) to be applied at different stages of match.

Below is the list of Element Types available in the library with default PreProcessing Function, Tokenizer Function and Scoring Function.

Element Type	PreProcessing Function	Tokenizer Function	Scoring Function
NAME	namePreprocessing()	wordTokenizer()	soundex()
TEXT	removeSpecialChars()	wordTokenizer()	soundex()
ADDRESS	addressPreprocessing()	wordTokenizer()	soundex()
EMAIL	removeDomain()	nGramTokenizer()	equality()
PHONE	numericValue()	valueTokenizer()	phoneNumber()

Applying the Match

The entry point for running this program is through MatchService class. Import Match service using spring in your class

@Autowired
MatchService matchService;

It support 3 ways to match the documents

Match a list of Documents : This is useful if you have an existing list of document, and want to find out which of them might have potential duplicates. A typical de-dup use case

matchService.applyMatch(List<Document> documents)

Match a list of Document with Existing List : This is useful for matching a new list of document with an existing list in your system. For example if your are performing a bulk import and want to find out if any of them match with existing data

matchService.applyMatch(List<Document> documents, List<Document> matchWith)

Match a Document with Existing List : This is useful when a new document is being created and you want to ensure that a similar document does not already exist in your system

matchService.applyMatch(Document document, List<Document> matchWith)

Output

The response of the library is essentially a Match<Document> object. It has 3 attributes

Data : This is the source Document on which the match is applied
MatchedWith : This is the target Document that the data matched with
Result : This is the probability score between 0.0 - 1.0 indicating how similar the 2 documents are

The response is grouped by the Data attribute, so from any of the MatchService methods the response is map

Map<Document, List<Match<Document>>>

End User Configuration

This library allows configuration at various level. These are all the configurable elements allowed

Configurable Functions

All the 4 Stages defined above are configurable java functions passed into the library. Although there is a rich set of predefined functions, they can easily be overridden or applied in addition of the exiting functions.

For example if an ElementType is defined as a TEXT the removeSpecialChars which is a simple transform function is applied. This is a Function<String, String> that converts a String to another String

If you want change this and say in "Name" field remove any suffix like "jr.", "sr.", etc

This can be over-ridden when creating an Element using the builder pattern

new Element.Builder().setType(TEXT).setValue(user.getName())
                            .setPreProcessingFunction(str -> str.replace("jr.", ""))
                            .createElement())

Using the function composition, you could also add more functionality to the existing function

new Element.Builder().setType(TEXT).setValue(user.getName())
                            .setPreProcessingFunction(removeSpecialChars().compose(str -> str.replace("jr.", ""))
                            .createElement())

This allows you to change the entire behavior of how matches are applied at all the 4 stages

Document Configuration

Key : Required field indicating unique primary key of the document
Elements : Set of elements for each document
Threshold : A double value between 0.0 - 1.0 above which the document be considered as match.
ScoringFunction : The ScoringFunction used to aggregate individual Element scores to a Document score

Element Configuration

Value : String representation of the value to match
Type : These are predefined elements, which applies relevant functions for "PreProcessing", "Tokenization" and "SimilarityMatch"
Threshold : A double value between 0.0 - 1.0 above which the element be considered as match.
Weight : A value applied to an element to increase or decrease the document score. The default is 1.0, any value above that will increase the document score if that element is matched.
PreProcessingFunction : The PreProcessingFunction function to be used
TokenizerFunction : The TokenizerFunction to be used
SimilarityMatchFunction : The SimilarityMatchFunction to be used
ScoringFunction : The ScoringFunction used to aggregate individual Token scores to an Element score

Demo

To run a simple match. Consider this example of names (in file src/test/resources/demo.csv)

Stephen Wilkson
John Pierre
Wilson, Stephen
Pierre john
Stephen Kilsman wilkson

Run this maven test for the above set of example

mvn -Dtest=com.intuit.fuzzymatcher.component.MatchServiceTest#itShouldApplyMatchForDemo test

You should see a response like this printed on your console

Data: {[{'Stephen Wilkson'}]} Matched With: {[{'Stephen Kilsman wilkson'}]} Score: 0.6666666666666666
Data: {[{'John Pierre'}]} Matched With: {[{'Pierre john'}]} Score: 1.0
Data: {[{'Pierre john'}]} Matched With: {[{'John Pierre'}]} Score: 1.0
Data: {[{'Stephen Kilsman wilkson'}]} Matched With: {[{'Stephen Wilkson'}]} Score: 0.6666666666666666

Change the input and see the results change .... Happy Matching!!!

datree-intuit-org / fuzzy-matcher Goto Github PK

fuzzy-matcher's Introduction