Giter Site home page Giter Site logo

neurotech-hq / pysimilar Goto Github PK

View Code? Open in Web Editor NEW
19.0 3.0 10.0 1.52 MB

A python library for computing the similarity between two string(text) based on cosine similarity

Home Page: https://kalebu.github.io/pysimilar/

License: MIT License

Python 100.00%
nlp natural-language-processing natural-language-understanding natural-language python-tanzania tanzania cosine-similarity

pysimilar's Introduction

Downloads Downloads Downloads

A python library for computing the similarity between two string(text) based on cosine similarity made by kalebu

Become a patron

How does it work ?

It uses Tfidf Vectorizer to transform the text into vectors and then obtained vectors are converted into arrays of numbers and then finally cosine similary computation is employed resulting to output indicating how similar they are.

Installation

You can either install it directly from Github or use pip to install it, here is how you to install it directly from github;

$  git clone https://github.com/Kalebu/pysimilar
$  cd pysimilar
$ pysimilar -> python setup.py install

Installation with pip

$ pip install pysimilar

Example of usage

Pysimilar allows you to either specify the string you want to compare directly or specify path to files containing string you want to compare.

Here an example on how to compare strings directly;

>>> from pysimilar import compare
>>> compare('very light indeed', 'how fast is light')
0.17077611319011649

Here how to compare files with textual documents;

>>> compare('README.md', 'LICENSE', isfile=True)
0.25545580376557886

You can also compare documents with particular extension in a given directory, for instance let's say I want to compare all the documents with .txt in a documents directory here is what I will do;

Directory for documents used by the example below look like this

documents/
├── anomalie.zeta
├── hello.txt
├── hi.txt
└── welcome.txt

Here how to compare files of a particular extension

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = '.txt'
>>> comparison_result = pysimilar.compare_documents('documents')
>>> [['welcome.txt vs hi.txt', 0.6053485081062917],
    ['welcome.txt vs hello.txt', 0.0],
    ['hi.txt vs hello.txt', 0.0]]

You can also sort the comparison score based on their score by changing the ascending parameter, just as shown below;

>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['welcome.txt vs hi.txt', 0.6053485081062917]]

You can also set pysimilar to include files with multiple extensions

>>> import pysimilar
>>> from pprint import pprint
>>> pysimilar.extensions = ['.txt', '.zeta']
>>> comparison_result = pysimilar.compare_documents('documents', ascending=True)
>>> pprint(comparison_result)
[['welcome.txt vs hello.txt', 0.0],
 ['hi.txt vs hello.txt', 0.0],
 ['anomalie.zeta vs hi.txt', 0.4968161174826459],
 ['welcome.txt vs hi.txt', 0.6292275146695526],
 ['welcome.txt vs anomalie.zeta', 0.7895651507603823]]

Contributions

If you have anything valuable to add to the lib, whether its a documentation, typo error, source code, please don't hesitate to contribute just fork it and submit your pull request and I will try to be as friendly as I can to assist you making the contributions.

Give it a star

Did you find this repo useful to you ? then give it a star so as more people can be aware of it and use it, Share that love *

All the Credits

All the Credits to kalebu and other future contributors

pysimilar's People

Contributors

kalebu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pysimilar's Issues

Can't seem to run this code

Traceback (most recent call last):
File "C:\Users\lsa\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.10\cosinefin.py", line 4, in
comparison_result = pysimilar.compare_documents('C:/Users/lsa/desktop/sap')
File "C:\Users\lsa\AppData\Local\Programs\Python\Python310\lib\site-packages\pysimilar_init_.py", line 117, in compare_documents
loaded_documents: Dict = self.load_files(path_to_documents)
File "C:\Users\lsa\AppData\Local\Programs\Python\Python310\lib\site-packages\pysimilar_init_.py", line 71, in load_files
load_documents: List[str] = [self.load_file(path_to_document)
File "C:\Users\lsa\AppData\Local\Programs\Python\Python310\lib\site-packages\pysimilar_init_.py", line 71, in
load_documents: List[str] = [self.load_file(path_to_document)
File "C:\Users\lsa\AppData\Local\Programs\Python\Python310\lib\site-packages\pysimilar_init_.py", line 66, in load_file
content = document.read()
File "C:\Users\lsa\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 241: character maps to

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.