davidbelicza / textrank Goto Github PK

:wink: :cyclone: :strawberry: TextRank implementation in Golang with extendable features (summarization, phrase extraction) and multithreading (goroutine).

License: MIT License

Go 99.61% Shell 0.09% Dockerfile 0.31%

go golang textrank summarization token sentence-classification phrase-extraction pagerank

textrank's People

Stargazers

Watchers

textrank's Issues

Cannot use latest version with go.mod

I am unable to use the latest version of the library, v2.1.2, with go modules.
I get the error: require github.com/DavidBelicza/TextRank: version "v2.1.2" invalid: should be v0 or v1, not v2
To use v2.1.2 in go.mod we need to make the library available at github.com/DavidBelicza/TextRank/v2, according to https://github.com/golang/go/wiki/Modules#semantic-import-versioning

TextToRank should accept interface instead of ParsedSentence as struct

If I want to use convert.TextToRank and I have a customized rule, it is not possible to use convert package because it accepts parse.ParsedSentence struct as the first argument.

I had a quick look and it seems should be possible to change parse.ParsedSentence to an interface.

I can create a PR if it does make sense to you @DavidBelicza!

Is it possible to have phrase with 3 or 4 words too?

Thanks for this library,its very useful,

Is it possible to have "phrase" with 3 or 4 words too?

Textrank API modification to use on web

The provider variable in textrank.go shares its value between requests in case of a running Go webserver.

The solution is

remove provider,
create a TextRank struct in textrank.go,
create a constructor,
store rank object inside TextRank struct,
transform GetRank, Append and Ranking functions to methods,
replace rankId parameters with rank,
update go docs,
update readme,
test coverage should be 100%, go-report should be A+

Problem ranking text containing abbreviation, such as U.S.A

Hi, first of all thanks for this library, you are awesome 🚀

I'm having an issue ranking text that contains abbreviation such as U.S.A (short for United States of America) or No. 7 (short for Number 7) as the . is currently used here https://github.com/DavidBelicza/TextRank/blob/master/parse/rule.go#L21 to set the bounds of words.

Do you currently have a way to get around this problem? Or should I simply create a new rule implementing the Rule interface that checks for known abbreviations?

Add tokenization of words

Adding a tokenization library or making this an optional input could really help.

For instance, what if I wanted to search only top sentences related to "food" for recipes or "locations" for scanning destinations in blog posts?

I realize I can do this using the chain phrase myself but basic NLP entity extraction would be nice to have so we can group by broader categories.

Also, the term "chain phrase" implies that the words will be found in order, like in a "chain". It would be trivial and somewhat useful to add a "preserve order" option so we can proximity search within phrases.
e.g. "captain james kirk" should have preserve order so we can find "captain, james t. kirk" but we don't need "kirk captain james". Suggestion.

Otherwise the library works really well, it's clean and straightforward, thanks!

TokenizeText func doesn't return all sentences

I expect text.parsedSentences should contain all sentences. let me explain the problem with code :)

place it in parse/tokenizer_test.go file

func TestTokenizeText(t *testing.T) {
	rule := NewRule()

	text := TokenizeText("Hi!!!", rule)
	assert.Equal(t, "Hi!", text.parsedSentences[0].original)
	assert.Equal(t, "!", text.parsedSentences[1].original)
	assert.Equal(t, "!", text.parsedSentences[2].original)
}

I expect this test should be passed, but apparently, it is not!

More accurate ranking algorithm

In case 1, the icons - tray and extension - gnome phrases got 0.5 weight, but it's clearly noticeable that extension - gnome is a more important phrase than icons - tray. The two phrase's occurrence is equal but the gnome word itself has more hit than icon or tray words. Follow this logic, the extension - gnome weight should be > 0.5 and < 1.

But this logic should not make that side effect what happens in case 2, all phrases become important what contains the word gnome.

Case 1 and case 2 are correct, so they shouldn't be modified but a new algorithm is required what implement the above logic. It should be a new, third Algorithm interface implementation: SupervisedAlgorithm or ComparatorAlgorithm.

Case 1, FindPhrases method result from ranked text by AlgorithmDefault

Phrase: gnome - shell, Occurrence: 5, Weight: 1
Phrase: icons - tray, Occurrence: 3, Weight: 0.5
Phrase: extension - gnome, Occurrence: 3, Weight: 0.5
Phrase: dock - dash, Occurrence: 2, Weight: 0.25

Case 2, FindPhrases method result from ranked text by AlgorithmMixed

Phrase: gnome - shell, Occurrence: 5, Weight: 1
Phrase: gnome - caffeine, Occurrence: 2, Weight: 0.8
Phrase: gnome - takes, Occurrence: 1, Weight: 0.73333335
Phrase: gnome - commonly, Occurrence: 1, Weight: 0.73333335

davidbelicza / textrank Goto Github PK

textrank's People

Stargazers

Watchers

Forkers

textrank's Issues

Cannot use latest version with go.mod

TextToRank should accept interface instead of ParsedSentence as struct

Is it possible to have phrase with 3 or 4 words too?

Textrank API modification to use on web

Problem ranking text containing abbreviation, such as U.S.A

Add tokenization of words

TokenizeText func doesn't return all sentences

More accurate ranking algorithm

Case 1, FindPhrases method result from ranked text by AlgorithmDefault

Case 2, FindPhrases method result from ranked text by AlgorithmMixed

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent