Giter Site home page Giter Site logo

krishnansg / nutshell Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 1.0 46 KB

An unsupervised text summarization and information retrieval library under the hood using natural language processing models

Home Page: https://pypi.org/project/pynutshell/

License: MIT License

Python 100.00%
summarization nlp text information-retrieval information-extraction keyword-extraction ranking-algorithms similarity-algorithms tokenizer text-summarization

nutshell's Introduction

Nutshell

CodeFactor Downloads Downloads PyPI version

A simple to use yet robust python library containing tools to perform:

  • Text summarization
  • Information retrieval
  • Finding similarities
  • Sentence ranking
  • Keyword extraction
  • and many more in progress...

Getting Started

These instructions will get you a copy of the project and ready for use for your python projects.

Installation

Quick Access

  • Download from PyPi.org

    pip install pynutshell

Developer Style

  • Requires Python version >=3.6

  • Clone this repository using the command:

    git clone https://github.com/KrishnanSG/Nutshell.git
    cd Nutshell
  • Then install the library using the command:

      python setup.py install
    

Note: The package is distributed as pynutshell due to unavailability of the name, but the package name is nutshell and request you not to get confused.

How does the library work?

The library has several components:

  1. Summarizers
  2. Rankers
  3. Similarity Algorithms
  4. Information Retrievers
  5. Keyword Extractors

Summarization

A technique of transforming or condensing textual information using natural language processing techniques.

Types of summarization

Extractive

This technique is very much similar to highlighting important sentence while we read a book.

The algorithm finds the important sentences in the corpus (NLP term for raw input text) by reducing the similarity between sentence by removing sentences which are very similar to each other by retaining one among them.

Though this method is a powerful it fails to combine 2 or more sentences into a complex sentence, there by not provide optimal result for some cases.

Abstractive

This technique unlike extractive is much more complex and robust in producing summaries. The algorithm used for this technique performs sentence clustering using Semantic Analysis (finding the meaning of sentence).

Sentence Ranking

Text rankers are algorithms similar to web page ranking algorithms used to rank web pages. These rankers find the importance of the sentence in the document and provide ranks to the sentence, thereby providing us with the information of how important the sentence is.

Similarity Algorithms

Text similarity algorithms define the similarity between 2 documents (sentences).

A few classic algorithms for finding similarity are:

  1. Cosine Similarity
  2. Euclidean Distance

Note: word2vec is an important transformation step used to convert words into vectors to easily perform mathematical operations.

Features

Checklist of features the library currently offers and plans to offer.

  • Keyword Extraction
  • Text Tokenizers
  • Text cleaners
  • Semantic decoder
  • Summarization
    • Extractive
    • Abstractive
  • Text Rankers
    • Intermediate
    • Advanced
  • Information Retrieval
    • Intermediate
    • Advanced

Examples

Summarization

A simple example on how to use the library and perform extractive text summarization from the given input text(corpus).

from nutshell.algorithms.information_retrieval import ClassicalIR
from nutshell.algorithms.ranking import TextRank
from nutshell.algorithms.similarity import BM25Plus
from nutshell.model import Summarizer
from nutshell.preprocessing.cleaner import NLTKCleaner
from nutshell.preprocessing.preprocessor import TextPreProcessor
from nutshell.preprocessing.tokenizer import NLTKTokenizer
from nutshell.utils import load_corpus, construct_sentences_from_ranking

# Example
corpus = load_corpus('input.txt')
print("\n --- Original Text ---\n")
print(corpus)

preprocessor = TextPreProcessor(NLTKTokenizer(), NLTKCleaner())
similarity_algorithm = BM25Plus()
ranker = TextRank()
ir = ClassicalIR()

# Text Summarization
model = Summarizer(preprocessor, similarity_algorithm, ranker, ir)
summarised_content = model.summarise(corpus, reduction_ratio=0.70, preserve_order=True)

print("\n --- Summarized Text ---\n")
print(construct_sentences_from_ranking(summarised_content))

Keyword Extraction

A simple example on how to use the library and perform keyword extraction from the given input text(corpus).

from nutshell.algorithms.information_retrieval import ClassicalIR
from nutshell.model import KeywordExtractor
from nutshell.preprocessing.cleaner import NLTKCleaner
from nutshell.preprocessing.preprocessor import TextPreProcessor
from nutshell.preprocessing.tokenizer import NLTKTokenizer
from nutshell.utils import load_corpus

corpus = load_corpus('input.txt')
print("\n --- Original Text ---\n")
print(corpus)

# Text Keyword Extraction
preprocessor = TextPreProcessor(NLTKTokenizer(), NLTKCleaner(skip_stemming=True))
keyword_extractor = KeywordExtractor(preprocessor, ClassicalIR())
keywords = keyword_extractor.extract_keywords(corpus, count=10, raw=False)


print("\n --- Keywords ---\n")
print(keywords)

Contribution

Contributions are always welcomed, it would be great to have people use and contribute to this project to help user understand and benefit from library.

How to contribute

  • Create an issue: If you have a new feature in mind, feel free to open an issue and add some short description on what that feature could be.
  • Create a PR: If you have a bug fix, enhancement or new feature addition, create a Pull Request and the maintainers of the repo, would review and merge them.

Authors

nutshell's People

Contributors

krishnansg avatar shruthi-22 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

shruthi-22

nutshell's Issues

references to the algorithms?

Hi,

Thanks a lot for making this tool!
I wonder what are the extractive summarization algorithm you use? Could you please refer me to a paper or website?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.