Giter Site home page Giter Site logo

Text Tidying tools for R? about tidytext HOT 9 CLOSED

juliasilge avatar juliasilge commented on July 22, 2024
Text Tidying tools for R?

from tidytext.

Comments (9)

juliasilge avatar juliasilge commented on July 22, 2024

Hello there, Peter! Cleaning text is definitely a challenge. In the book, we went the direction of including information on cleaning text within the case studies, not in a separate chapters. There are notes on how we cleaned text in the chapter on NASA metadata (which included HTML tags), the Usenet chapter (where we cleaned headers and such), and the Twitter chapter (where we talked about using regex to extract words and Twitter hashtags/usernames/etc).

There are certainly a lot of strategies to use! They are so specific to each dataset that we chose to show how various strategies can be applied in context. These are mostly of the stringr/regex variety. Thanks for sharing your input.

from tidytext.

pgensler avatar pgensler commented on July 22, 2024

Julia,

Thanks for your comments. Is there any way to contribute an example to the book? I'm not sure if you have published the book or plan on it. The only reason I ask is that I've encountered so many issues with this single project that it may be worthwhile to include as a case study or vignette if you are willing, as I think it would help others with text mining. Let me know your thoughts on this.

from tidytext.

pgensler avatar pgensler commented on July 22, 2024

Hi Julia,

Not to beat this issue, but would you be willing to accept PR's for additional vignette's for your package? As I'm going through the NASA chapter in your book, I think that this information is definitely quite valuable, but I'm wondering if it may be helpful to have a one-page summary of resources to turn to for basic text tasks, similar to what you have outlined in your book?
Possibly:

  • Dealing with encoding in R, as this can definitely be a pain point (esp. UTF-8 strings)
  • JSON parsing, clearly lots of resources
  • Web scraping, as more than likely you will need to do it to acquire text from web pages

It's not meant to be a comprehensive tutorial, but I'm curious to see if you think this fits with the spirit of tidy-text. It may very well not be meant for your package, which is fine with me. Let me know, and I'd be willing to help put together a minimal vignette.

This also relates to the janeaustenr package used in tidytext:
https://cran.r-project.org/web/packages/utf8/vignettes/utf8.html

from tidytext.

juliasilge avatar juliasilge commented on July 22, 2024

Peter, thanks so much for your thoughtful comment, and I apologize for being slow to get back to you on this. I think this set of skills is SO IMPORTANT and I see a lot of value in compiling a good list of resources for text cleaning tasks. I literally was just responding to GitHub issues and blog comments from people with encoding issues for non-English languages. 😂 And knowing what to do with JSON or how to scrape the web is also super important.

Vignettes are typically about the code in a package, though, and how to use those functions. I've never seen a vignette that consists mainly of a list of resources external to that package; it is just not the right venue. If you put together this kind of guide and publish it, I would love to link to it from my blog, tweet about it, etc, but I don't think it's right for a vignette. Please let me know if you put together a valuable resource like this and publish it!

from tidytext.

pgensler avatar pgensler commented on July 22, 2024

Thanks so much for your reply! That's totally understandable, I wish I had used twitter more and more to get help on these issues. Yeah I'm a bit torn as to where to house these resources as I seem to run into these issues more and more, especially with R.

I can understand why it would not make sense to have it located in a vignette. My only gripe is that if I'm a new R user, I want someone to get off on the right foot, and not merely drown in not knowing what to do (like with docker, shameless plug for my blog post: https://medium.com/@peterjgensler/creating-sandbox-environments-for-r-with-docker-def54e3491a3) (to be fair, I saw your dockerfile, and I restrained myself 😭)

Do you think it's appropriate to have a portion of the readme at the bottom dedicated to simply referring people to resources that may assist them in the data munging process? I know it's not entirely in-line with the package's intent, but it's there if someone needs it?

I'm currently in the process of a creating a blog post using this data as an example, but I feel like I've ran into so many issues that I'm not even sure how to "set someone up for success", given all the troubles I've ran into just trying to get it in a workable format. I wonder if most people (like myself 😃) enter into text mining assuming that you can just slam a large ish file (like the yelp dataset challenge) into this, and it should work just fine (which is definitely not the case 😂). Maybe this is just me, but I've just had a mirage of nightmares attempting to clean this data:
https://community.rstudio.com/t/split-uneven-length-vectors-to-columns-with-tidyr/2704/10
https://stackoverflow.com/questions/43218761/by-row-vs-rowwise-iteration/43236272?noredirect=1#comment75570823_43236272

To give you a bit of context, just merely tokenizing the raw data took over an hour with a different program (called Alteryx), and went from a 1GB file to 17GB (for one file, the other is here)

from tidytext.

juliasilge avatar juliasilge commented on July 22, 2024

A link in the README is a good idea; if you put together a list of resources, let me know and I can add it.

(And I will have to check out your Docker post; I am excited to learn more about good practices for Docker.)

from tidytext.

pgensler avatar pgensler commented on July 22, 2024

Sounds good, I'll do a roundup, and see what I can come up with. As a new developer, Docker is so worthwhile but the pain......

from tidytext.

pgensler avatar pgensler commented on July 22, 2024

FYI @juliasilge I intended to write about functions, but a good portion covered encoding issues in R:
https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77

from tidytext.

github-actions avatar github-actions commented on July 22, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from tidytext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.