Giter Site home page Giter Site logo

Stemming? about tidytext HOT 17 CLOSED

juliasilge avatar juliasilge commented on August 24, 2024 4
Stemming?

from tidytext.

Comments (17)

dgrtwo avatar dgrtwo commented on August 24, 2024 9

I think that SnowballC::wordStem already fits well with the tidy text philosophy, since it operates on a vector of tokens and returns a vector of the same length. E.g.:

library(dplyr)
library(tidytext)
library(janeaustenr)
library(SnowballC)

data_frame(txt = prideprejudice)
  unnest_tokens(word, txt) %>%
  mutate(word = wordStem(word))

I could imagine adding a stem argument to unnest_tokens as to whether it should stem the tokens afterwards, but I think that's cramming too much in that function (we could also have an argument for removing stop words, but we choose to make that a separate step).

It's possible there are other stemming algorithms that return something other than a vector, in which case we could set up tidy functions for their output (or something similar!), but I don't think it's necessary for SnowballC::wordStem.

from tidytext.

juliasilge avatar juliasilge commented on August 24, 2024 9

I'll update this here! There are several packages that implement stemming in R, including hunspell, SnowballC, and proustr. Since these all can be used with tidy data principles already, we are not going to implement any more stemming functions in tidytext as well, but I would like to include stemming in a vignette at some point so I am keeping this issue open.

from tidytext.

jhoertt avatar jhoertt commented on August 24, 2024 8

Have there been any updates on stemming in the tidytext package since this last post?

from tidytext.

jeroen avatar jeroen commented on August 24, 2024 1

The new version of hunspell works on all platforms and has zero dependencies (it bundles the libhunspell source code). The package includes a speller (hunspell_check), stemmer (hunspell_stem) and tokenizer (hunspell_parse) which can parse formats text, man, latex, html or xml.

from tidytext.

jmcastagnetto avatar jmcastagnetto commented on August 24, 2024 1

Just my very late 0.02 monetary units.

To my not so knowledgeable eyes, the hunspell results in Spanish make more sense than SnowballC. I've put an example at: https://gist.github.com/jmcastagnetto/3b0776f7558621e5d06a2a0981b20c2e

from tidytext.

juliasilge avatar juliasilge commented on August 24, 2024 1

The stemming chapter of our new book covers this topic in detail.

from tidytext.

juliasilge avatar juliasilge commented on August 24, 2024

OK, that does work nicely. (I should have looked into the function in more detail! I didn't see that it was just vectors i/o.) I lean against adding the stemming to unnest_tokens. From the initial reading I've done, it seems like the stemming algorithm in SnowballC (Porter's algorithm) is the standard/best/normal one to use, so there you go! ✔️

I think I will leave this issue open because it would be good to incorporate a stemming example into a vignette at some point.

from tidytext.

dgrtwo avatar dgrtwo commented on August 24, 2024

hunspell_parse looks very interesting! I've wanted a way to parse non-text formats like LaTeX or HTML. Until now my workaround (pasting together text, parsing with xml2, and using xml_text) is quite the hassle.

@juliasilge, what do you think of adding a format argument to unnest_tokens. If format = "text" (default), for now I'd rather use the tokenizers package since that's what we've been using. But if the format is man/latex/html/xml, we can switch to using the hunspell tokenizer (and in turn raise an error if they tried to parse by anything but word). That's an easy dependency to take on.

The stemmer I think is worth mentioning in stemming vignette but we don't need to build in, for same reasons as above. (Incidentally @jeroenooms is there a reason hunspell_stem returns a list rather than a vector? Will items in the list ever be longer than 1?)

from tidytext.

jeroen avatar jeroen commented on August 24, 2024

Will items in the list ever be longer than 1?)

Yes, if there are more than one matches. Try this:

words <- c("love", "loving", "lovingly", "loved", "lover", "lovely", "love")
hunspell_stem(words)
hunspell_analyze(words)

from tidytext.

juliasilge avatar juliasilge commented on August 24, 2024

These results definitely don't look like Porter algorithm stemming. I just did a bit of googling and it looks like the hunspell stemming algorithm is a different approach entirely, a dictionary-based approach?

from tidytext.

jeroen avatar jeroen commented on August 24, 2024

Hunspell uses a more sophisticated method which generalizes to non-english languages. It is based on a special dictionary format that defines valid stem+suffix syntax in a given language. Most major operating systems already include such dictionaries in the language packs for the OS, so it's really powerful if you want your package to work for foreign languages as well :)

from tidytext.

dgrtwo avatar dgrtwo commented on August 24, 2024

I implemented the format argument to unnest_tokens in c78fbc5 using hunspell (also tests and NEWS).

I really like how it can now tokenize HTML:

library(readr)
library(tidytext)

data_frame(text = read_lines("https://www.amazon.com")) %>%
  unnest_tokens(word, text, format = "html")
#> # A tibble: 326 x 1
#>           word
#>          <chr>
#> 1       amazon
#> 2          com
#> 3       online
#> 4     shopping
#> 5          for
#> 6  electronics
#> 7      apparel
#> 8    computers
#> 9        books
#> 10        dvds
#> # ... with 316 more rows

(This particular parsing is a bit slow; probably just the size and amount of Javascript etc on the page, but I can already think of work applications where this will be useful).

from tidytext.

sagitaninta avatar sagitaninta commented on August 24, 2024

I implemented the format argument to unnest_tokens in c78fbc5 using hunspell (also tests and NEWS).

I really like how it can now tokenize HTML:

library(readr)
library(tidytext)

data_frame(text = read_lines("https://www.amazon.com")) %>%
  unnest_tokens(word, text, format = "html")
#> # A tibble: 326 x 1
#>           word
#>          <chr>
#> 1       amazon
#> 2          com
#> 3       online
#> 4     shopping
#> 5          for
#> 6  electronics
#> 7      apparel
#> 8    computers
#> 9        books
#> 10        dvds
#> # ... with 316 more rows

(This particular parsing is a bit slow; probably just the size and amount of Javascript etc on the page, but I can already think of work applications where this will be useful).

Is this change already merged on the latest version of tidytext? In the latest version of tidytext in CRAN, hunspell::hunspell_stem() could not be used like SnowballC::wordStem() as the first return list whereas the second return a corresponding vector.

from tidytext.

juliasilge avatar juliasilge commented on August 24, 2024

Yep, the change for using hunspell for tokenizing HTML has been in the CRAN version of tidytext since late 2016. You can see that announcement, with an example, on my blog here.

If you'd like to use the hunspell stemmer, you do need to handle the output differently than the Snowball stemmer, because it is a different approach.

For example, if you want to stem using Snowball, you'd do that like this:

library(dplyr)
library(tidytext)
library(SnowballC)

tibble(txt = janeaustenr::prideprejudice) %>%
  unnest_tokens(word, txt) %>%
  mutate(word_stem = wordStem(word))
#> # A tibble: 122,204 x 2
#>    word      word_stem
#>    <chr>     <chr>    
#>  1 pride     pride    
#>  2 and       and      
#>  3 prejudice prejudic 
#>  4 by        by       
#>  5 jane      jane     
#>  6 austen    austen   
#>  7 chapter   chapter  
#>  8 1         1        
#>  9 it        it       
#> 10 is        i        
#> # … with 122,194 more rows

Created on 2019-05-04 by the reprex package (v0.2.1)

If you would like to use the hunspell stemmer, you would need to do this instead:

library(tidyverse)
library(tidytext)
library(hunspell)

tibble(txt = janeaustenr::prideprejudice) %>%
  unnest_tokens(word, txt) %>%
  mutate(word_stem = hunspell_stem(word)) %>%
  unnest(word_stem)
#> # A tibble: 121,057 x 2
#>    word      word_stem
#>    <chr>     <chr>    
#>  1 pride     pride    
#>  2 and       and      
#>  3 prejudice prejudice
#>  4 by        by       
#>  5 chapter   chapter  
#>  6 1         1        
#>  7 it        it       
#>  8 is        i        
#>  9 a         a        
#> 10 truth     truth    
#> # … with 121,047 more rows

Created on 2019-05-04 by the reprex package (v0.2.1)

Since the hunspell stemmer outputs possibly more than one stem per word, we need to unnest() to get the stems on their own rows. This is one benefit of using tidy data; now we can analyze all these different stems, along with their root words.

from tidytext.

Mihiretukebede avatar Mihiretukebede commented on August 24, 2024

Bought your nice book! I used tidytext and it seems it didn't do word-stemming. Obese and obesity were counted differently (see image). Should I use the snowballC suggested by dgrtwo?

image

from tidytext.

juliasilge avatar juliasilge commented on August 24, 2024

You can stem text in a tidy data workflow if you like, as shown here. There are several options, like SnowballC or spaCy for lemmatization.

from tidytext.

github-actions avatar github-actions commented on August 24, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from tidytext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.