Giter Site home page Giter Site logo

amansrivastava17 / bns-short-text-similarity Goto Github PK

View Code? Open in Web Editor NEW
25.0 3.0 3.0 6 KB

๐Ÿ“– Use Bi-normal Separation to find document vectors which is used to compute similarity for shorter sentences.

License: MIT License

Python 100.00%
bns text-vectorization nlp cosine-similarity text-similarity text-classification bns-vectorizer tf-idf term-frequency short-text-semantic-similarity

bns-short-text-similarity's Introduction

BNS Vectorizer - Improved TF-IDF for shorter text

Bi-normal Separation is a popular method to score textual data importance against its belonging category, it can efficiently find out important keywords in a document and assign a weighted positive score, also provide negative scoring for unimportant word for a document.

Below are the description of variables used to calculate Bi-normal separation score for a word for each category (or classes).

Why BNS Better than TF-IDF?

Due to the short length of the documents, the existing approaches of TF-IDF and other term frequency bases approaches, did not perform well as there are usually no words that occur more than once per document, so we need to use an approach that does not rely on term frequency within the document.

BNS overcomes this problem as it assign weights to each term based on their occurance in positive and negative categories (or classes). A term occurs often in the positive samples and seldom in negative ones, will get a high BNS weight.

Also as idf have a general value for term across categories, bns assign different weightage score for term in different category.

Formula to calculate BNS:

  • pos = number of positive training cases, typically minority,

  • neg = number of negative training cases,

  • tp = number of positive training cases containing word,

  • fp = number of negative training cases containing word,

  • fn = pos - tp,

  • tn = neg - fp,

  • tpr (true positive rate) = P(word | positive class) = tp/pos*

  • fpr (false positive rate) = P(word | negative class) = fp/neg,

  • bns (Bi-Normal Separation) = F^(-1)(tpr) โ€“ F^(-1)(fpr)

    F^(-1) is the inverse Normal cumulative distribution function

Usage:

Create BNS Vectorizer
from bns import BNS
documents = ['please book flights to mumbai', 'need a flight to goa', 'airline price for 			   2 adult', 'plan a trip to goa', 'book a taxi for me', 'book ola for home',              'show uber around me', 'nearby gym around me', 'nearby by temple',
             'i want to know nearby cinema hall in mumbai']

categories = ['book_flight', 'book_flight', 'book_flight', 'book_flight', 'book_taxi', 				  'book_taxi', 'book_taxi', 'nearby', 'nearby', 'nearby']

BNS_VECTORIZER = BNS()
BNS_VECTORIZER.fit(documents, categories)
Calculate Cosine similarity
from operator import itemgetter
from sklearn.metrics.pairwise import cosine_similarity

test_documents = ['book me a flight please']
test_bns_vectors = BNS_VECTORIZER.transform(test_documents)

# Lets find most similar sentence and category for given test document
results = []
for category in test_bns_vectors.keys():
    vector = test_bns_vectors[category]
    category_trained_sentence_vectors = BNS_VECTORIZER.vectors[category]
    category_trained_sentence = BNS_VECTORIZER.sentences_category_map[category]
    cosine_scores = cosine_similarity(vector, category_trained_sentence_vectors)[0]
    for score, sent in zip(cosine_scores, category_trained_sentence):
        results.append({'match_sentence':sent, 'category': category, 'score':score})

results = sorted(results, key=itemgetter('score'), reverse=True)
for each in results:
    print each

Above similarity method might not produce good results as there are no preprocessing involved, here you can refer to my previous repository to perform various text preprocessing involved before sending documents for bns vectorizer creation.

link : text preprocessing python

There are still lots of improvement needed to compute similarity for shorter sentences, you must try the above methods and let me know if you have any improvements and suggestions

Thanks !!

bns-short-text-similarity's People

Contributors

amansrivastava17 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.