Giter Site home page Giter Site logo

davidbelicza / textrank Goto Github PK

View Code? Open in Web Editor NEW
196.0 7.0 22.0 1.34 MB

:wink: :cyclone: :strawberry: TextRank implementation in Golang with extendable features (summarization, phrase extraction) and multithreading (goroutine).

License: MIT License

Go 99.61% Shell 0.09% Dockerfile 0.31%
go golang textrank summarization token sentence-classification phrase-extraction pagerank

textrank's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

textrank's Issues

Textrank API modification to use on web

The provider variable in textrank.go shares its value between requests in case of a running Go webserver.

The solution is

  • remove provider,
  • create a TextRank struct in textrank.go,
  • create a constructor,
  • store rank object inside TextRank struct,
  • transform GetRank, Append and Ranking functions to methods,
  • replace rankId parameters with rank,
  • update go docs,
  • update readme,
  • test coverage should be 100%, go-report should be A+

Problem ranking text containing abbreviation, such as U.S.A

Hi, first of all thanks for this library, you are awesome ๐Ÿš€

I'm having an issue ranking text that contains abbreviation such as U.S.A (short for United States of America) or No. 7 (short for Number 7) as the . is currently used here https://github.com/DavidBelicza/TextRank/blob/master/parse/rule.go#L21 to set the bounds of words.

Do you currently have a way to get around this problem? Or should I simply create a new rule implementing the Rule interface that checks for known abbreviations?

Add tokenization of words

Adding a tokenization library or making this an optional input could really help.

For instance, what if I wanted to search only top sentences related to "food" for recipes or "locations" for scanning destinations in blog posts?

I realize I can do this using the chain phrase myself but basic NLP entity extraction would be nice to have so we can group by broader categories.

Also, the term "chain phrase" implies that the words will be found in order, like in a "chain". It would be trivial and somewhat useful to add a "preserve order" option so we can proximity search within phrases.
e.g. "captain james kirk" should have preserve order so we can find "captain, james t. kirk" but we don't need "kirk captain james". Suggestion.

Otherwise the library works really well, it's clean and straightforward, thanks!

TokenizeText func doesn't return all sentences

I expect text.parsedSentences should contain all sentences. let me explain the problem with code :)

place it in parse/tokenizer_test.go file

func TestTokenizeText(t *testing.T) {
	rule := NewRule()

	text := TokenizeText("Hi!!!", rule)
	assert.Equal(t, "Hi!", text.parsedSentences[0].original)
	assert.Equal(t, "!", text.parsedSentences[1].original)
	assert.Equal(t, "!", text.parsedSentences[2].original)
}

I expect this test should be passed, but apparently, it is not!

More accurate ranking algorithm

In case 1, the icons - tray and extension - gnome phrases got 0.5 weight, but it's clearly noticeable that extension - gnome is a more important phrase than icons - tray. The two phrase's occurrence is equal but the gnome word itself has more hit than icon or tray words. Follow this logic, the extension - gnome weight should be > 0.5 and < 1.

But this logic should not make that side effect what happens in case 2, all phrases become important what contains the word gnome.

Case 1 and case 2 are correct, so they shouldn't be modified but a new algorithm is required what implement the above logic. It should be a new, third Algorithm interface implementation: SupervisedAlgorithm or ComparatorAlgorithm.

Case 1, FindPhrases method result from ranked text by AlgorithmDefault

  • Phrase: gnome - shell, Occurrence: 5, Weight: 1
  • Phrase: icons - tray, Occurrence: 3, Weight: 0.5
  • Phrase: extension - gnome, Occurrence: 3, Weight: 0.5
  • Phrase: dock - dash, Occurrence: 2, Weight: 0.25

Case 2, FindPhrases method result from ranked text by AlgorithmMixed

  • Phrase: gnome - shell, Occurrence: 5, Weight: 1
  • Phrase: gnome - caffeine, Occurrence: 2, Weight: 0.8
  • Phrase: gnome - takes, Occurrence: 1, Weight: 0.73333335
  • Phrase: gnome - commonly, Occurrence: 1, Weight: 0.73333335

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.