Giter Site home page Giter Site logo

shashank-mugiwara / dedupeknn Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 25 KB

Fast Scalable Dedupe - Fuzzy Matching With Opensearch + nmslib + Rapidfuzz

Home Page: https://github.com/shashank-mugiwara/dedupeknn

License: MIT License

Python 100.00%
address-matching dedupe fuzzymatching nmslib opensearch rapidfuzz

dedupeknn's Introduction

dedupeknn

dedupeknn is an innovative project designed to address the challenges of finding duplicated addresses and performing address matching efficiently. Leveraging advanced technologies such as FastText for generating vector representations and OpenSearch as a vector data source, Dedupeknn offers powerful solutions for these tasks. By employing nearest neighbor algorithms from NMSLIB, dedupeknn achieves accurate and speedy address comparisons.

dedupeknn utilizes the FastText library, renowned for its effectiveness in generating high-quality vector representations of text inputs. By transforming address strings into vector embeddings, dedupeknn captures the semantic meaning and contextual information essential for accurate address comparisons.

The OpenSearch framework serves as the vector data source for dedupeknn. OpenSearch is a search db maintained by AWS that provides efficient storage and retrieval capabilities for large-scale vector datasets. With OpenSearch, dedupeknn can handle vast amounts of address data, ensuring scalability and performance.

To find the nearest neighbors of a given address vector, Dedupeknn employs nearest neighbor algorithms from NMSLIB. These algorithms efficiently search the vector data source to identify the most similar addresses, allowing for effective deduplication and address matching.

By combining the strengths of FastText, OpenSearch, and NMSLIB, dedupeknn delivers a robust and accurate solution for addressing the challenges of duplicated addresses and address matching. Its fast and efficient algorithms enable organizations to streamline their operations, enhance data quality, and improve customer experiences.

Running dedupeknn

  1. The project uses fastapi library and runs as a microservice. The dependencies include running opensearch cluster with opensearch-knn plugin installed.
  2. The configuration is loaded from the properties file - properties/opensearch-client.properties . Set the values accordingly with your installation setup.
  3. Creating a new conda environment - conda create -n dedupeknn python=3.10
  4. Install the required dependencies by - pip install -r requirements.txt
  5. Run the project - python main.py

Creating KNN index before ingesting data

The below example shows, how to create opensearch index with knn support.

{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "dedupe_vector_nmslib": {
          "type": "knn_vector",
          "dimension": 300,
          "method": {
            "name": "hnsw",
            "space_type": "cosinesimil",
            "engine": "nmslib",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        }
    }
  }
}

Note:

  1. We are using consinesimil as KNN similarity match pattern.
  2. Using KNN algorihm implementation from nmslib (non-metric space library).
  3. The fasttext model that we use for creating vector representation on input data is of 300 dimensions. So, we set the field dimensions value to 300. If you are using any other model with 500 or 800 dimensions, change this filed accordingly.

API's exposed

Ingesting data:

curl --location 'http://localhost:8080/api/v1/knn/doc/insert' \
--header 'Content-Type: application/json' \
--data '{
    "text": "#6/A Shashank J, 3rd Floor, Chetan Nilaya, 20 C Cross Rd, Ejipura, Bengaluru - 560047"
}'

Getting vector representation of a string

curl --location 'http://localhost:8080/api/v1/vector/representation' \
--header 'Content-Type: application/json' \
--data-raw '{
    "text": "*@) sdfd *29&3 -2030"
}'

Getting K-Nearest-Neighbours for the input string

curl --location 'http://localhost:8080/api/v1/similarity/knn/search' \
--header 'Content-Type: application/json' \
--data '{
    "text": "Chetan Nilaya, House No 6, 3rd Floor, Ejipur, Bangalore 560047",
    "size": 30,
    "k": 1
}'

Note:

  1. size - number of neighbours.
  2. k - level of neighbours.

Similarity Match

curl --location 'http://localhost:8080/api/v1/similarity/address/search' \
--header 'Content-Type: application/json' \
--data '{
    "text": "#6/A Third Floor, ChetanNilaya, 20C Road Ejipura,  bengaluru karnataka 560047",
    "size": 30,
    "k": 1,
    "threshold": 70
}'

dedupeknn's People

Contributors

shashank-mugiwara avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.