Do we want to implement a tidy interface to a stemming algorithm? I am aware of the on

The new version of <a href="http://www.rdocumentation.org/packages/hunspell/versions/1

Will items in the list ever be longer than 1?) <p dir="

Stemming? about tidytext HOT 17 CLOSED

juliasilge commented on August 24, 2024 4

Stemming?

from tidytext.

Comments (17)

dgrtwo commented on August 24, 2024 9

I think that SnowballC::wordStem already fits well with the tidy text philosophy, since it operates on a vector of tokens and returns a vector of the same length. E.g.:

library(dplyr)
library(tidytext)
library(janeaustenr)
library(SnowballC)

data_frame(txt = prideprejudice)
  unnest_tokens(word, txt) %>%
  mutate(word = wordStem(word))

I could imagine adding a stem argument to unnest_tokens as to whether it should stem the tokens afterwards, but I think that's cramming too much in that function (we could also have an argument for removing stop words, but we choose to make that a separate step).

It's possible there are other stemming algorithms that return something other than a vector, in which case we could set up tidy functions for their output (or something similar!), but I don't think it's necessary for SnowballC::wordStem.

from tidytext.

juliasilge commented on August 24, 2024 9

I'll update this here! There are several packages that implement stemming in R, including hunspell, SnowballC, and proustr. Since these all can be used with tidy data principles already, we are not going to implement any more stemming functions in tidytext as well, but I would like to include stemming in a vignette at some point so I am keeping this issue open.

from tidytext.

jhoertt commented on August 24, 2024 8

Have there been any updates on stemming in the tidytext package since this last post?

from tidytext.

jeroen commented on August 24, 2024 1

The new version of hunspell works on all platforms and has zero dependencies (it bundles the libhunspell source code). The package includes a speller (hunspell_check), stemmer (hunspell_stem) and tokenizer (hunspell_parse) which can parse formats text, man, latex, html or xml.

from tidytext.

jmcastagnetto commented on August 24, 2024 1

Just my very late 0.02 monetary units.

To my not so knowledgeable eyes, the hunspell results in Spanish make more sense than SnowballC. I've put an example at: https://gist.github.com/jmcastagnetto/3b0776f7558621e5d06a2a0981b20c2e

from tidytext.

juliasilge commented on August 24, 2024 1

The stemming chapter of our new book covers this topic in detail.

from tidytext.

juliasilge commented on August 24, 2024

OK, that does work nicely. (I should have looked into the function in more detail! I didn't see that it was just vectors i/o.) I lean against adding the stemming to unnest_tokens. From the initial reading I've done, it seems like the stemming algorithm in SnowballC (Porter's algorithm) is the standard/best/normal one to use, so there you go! ✔️

I think I will leave this issue open because it would be good to incorporate a stemming example into a vignette at some point.

from tidytext.

dgrtwo commented on August 24, 2024

hunspell_parse looks very interesting! I've wanted a way to parse non-text formats like LaTeX or HTML. Until now my workaround (pasting together text, parsing with xml2, and using xml_text) is quite the hassle.

@juliasilge, what do you think of adding a format argument to unnest_tokens. If format = "text" (default), for now I'd rather use the tokenizers package since that's what we've been using. But if the format is man/latex/html/xml, we can switch to using the hunspell tokenizer (and in turn raise an error if they tried to parse by anything but word). That's an easy dependency to take on.

The stemmer I think is worth mentioning in stemming vignette but we don't need to build in, for same reasons as above. (Incidentally @jeroenooms is there a reason hunspell_stem returns a list rather than a vector? Will items in the list ever be longer than 1?)

from tidytext.

jeroen commented on August 24, 2024

Will items in the list ever be longer than 1?)

Yes, if there are more than one matches. Try this:

words <- c("love", "loving", "lovingly", "loved", "lover", "lovely", "love")
hunspell_stem(words)
hunspell_analyze(words)

from tidytext.

juliasilge commented on August 24, 2024

These results definitely don't look like Porter algorithm stemming. I just did a bit of googling and it looks like the hunspell stemming algorithm is a different approach entirely, a dictionary-based approach?

from tidytext.

jeroen commented on August 24, 2024

Hunspell uses a more sophisticated method which generalizes to non-english languages. It is based on a special dictionary format that defines valid stem+suffix syntax in a given language. Most major operating systems already include such dictionaries in the language packs for the OS, so it's really powerful if you want your package to work for foreign languages as well :)

from tidytext.

dgrtwo commented on August 24, 2024

I implemented the format argument to unnest_tokens in c78fbc5 using hunspell (also tests and NEWS).

I really like how it can now tokenize HTML:

library(readr)
library(tidytext)

data_frame(text = read_lines("https://www.amazon.com")) %>%
  unnest_tokens(word, text, format = "html")
#> # A tibble: 326 x 1
#>           word
#>          <chr>
#> 1       amazon
#> 2          com
#> 3       online
#> 4     shopping
#> 5          for
#> 6  electronics
#> 7      apparel
#> 8    computers
#> 9        books
#> 10        dvds
#> # ... with 316 more rows

(This particular parsing is a bit slow; probably just the size and amount of Javascript etc on the page, but I can already think of work applications where this will be useful).

from tidytext.

sagitaninta commented on August 24, 2024

I implemented the format argument to unnest_tokens in c78fbc5 using hunspell (also tests and NEWS).

I really like how it can now tokenize HTML:
library(readr)
library(tidytext)

data_frame(text = read_lines("https://www.amazon.com")) %>%
  unnest_tokens(word, text, format = "html")
#> # A tibble: 326 x 1
#>           word
#>          <chr>
#> 1       amazon
#> 2          com
#> 3       online
#> 4     shopping
#> 5          for
#> 6  electronics
#> 7      apparel
#> 8    computers
#> 9        books
#> 10        dvds
#> # ... with 316 more rows
(This particular parsing is a bit slow; probably just the size and amount of Javascript etc on the page, but I can already think of work applications where this will be useful).

Is this change already merged on the latest version of tidytext? In the latest version of tidytext in CRAN, hunspell::hunspell_stem() could not be used like SnowballC::wordStem() as the first return list whereas the second return a corresponding vector.

from tidytext.

juliasilge commented on August 24, 2024

Yep, the change for using hunspell for tokenizing HTML has been in the CRAN version of tidytext since late 2016. You can see that announcement, with an example, on my blog here.

If you'd like to use the hunspell stemmer, you do need to handle the output differently than the Snowball stemmer, because it is a different approach.

For example, if you want to stem using Snowball, you'd do that like this:

library(dplyr)
library(tidytext)
library(SnowballC)

tibble(txt = janeaustenr::prideprejudice) %>%
  unnest_tokens(word, txt) %>%
  mutate(word_stem = wordStem(word))
#> # A tibble: 122,204 x 2
#>    word      word_stem
#>    <chr>     <chr>    
#>  1 pride     pride    
#>  2 and       and      
#>  3 prejudice prejudic 
#>  4 by        by       
#>  5 jane      jane     
#>  6 austen    austen   
#>  7 chapter   chapter  
#>  8 1         1        
#>  9 it        it       
#> 10 is        i        
#> # … with 122,194 more rows

^{Created on 2019-05-04 by the reprex package (v0.2.1)}

If you would like to use the hunspell stemmer, you would need to do this instead:

library(tidyverse)
library(tidytext)
library(hunspell)

tibble(txt = janeaustenr::prideprejudice) %>%
  unnest_tokens(word, txt) %>%
  mutate(word_stem = hunspell_stem(word)) %>%
  unnest(word_stem)
#> # A tibble: 121,057 x 2
#>    word      word_stem
#>    <chr>     <chr>    
#>  1 pride     pride    
#>  2 and       and      
#>  3 prejudice prejudice
#>  4 by        by       
#>  5 chapter   chapter  
#>  6 1         1        
#>  7 it        it       
#>  8 is        i        
#>  9 a         a        
#> 10 truth     truth    
#> # … with 121,047 more rows

^{Created on 2019-05-04 by the reprex package (v0.2.1)}

Since the hunspell stemmer outputs possibly more than one stem per word, we need to unnest() to get the stems on their own rows. This is one benefit of using tidy data; now we can analyze all these different stems, along with their root words.

from tidytext.

Mihiretukebede commented on August 24, 2024

Bought your nice book! I used tidytext and it seems it didn't do word-stemming. Obese and obesity were counted differently (see image). Should I use the snowballC suggested by dgrtwo?

from tidytext.

juliasilge commented on August 24, 2024

You can stem text in a tidy data workflow if you like, as shown here. There are several options, like SnowballC or spaCy for lemmatization.

from tidytext.

github-actions commented on August 24, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from tidytext.

Stemming? about tidytext HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent