Have you thought about putting together a list of commonly done operations to clean te

FYI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Text Tidying tools for R? about tidytext HOT 9 CLOSED

juliasilge commented on July 22, 2024

Text Tidying tools for R?

from tidytext.

Comments (9)

juliasilge commented on July 22, 2024

Hello there, Peter! Cleaning text is definitely a challenge. In the book, we went the direction of including information on cleaning text within the case studies, not in a separate chapters. There are notes on how we cleaned text in the chapter on NASA metadata (which included HTML tags), the Usenet chapter (where we cleaned headers and such), and the Twitter chapter (where we talked about using regex to extract words and Twitter hashtags/usernames/etc).

There are certainly a lot of strategies to use! They are so specific to each dataset that we chose to show how various strategies can be applied in context. These are mostly of the stringr/regex variety. Thanks for sharing your input.

from tidytext.

pgensler commented on July 22, 2024

Julia,

Thanks for your comments. Is there any way to contribute an example to the book? I'm not sure if you have published the book or plan on it. The only reason I ask is that I've encountered so many issues with this single project that it may be worthwhile to include as a case study or vignette if you are willing, as I think it would help others with text mining. Let me know your thoughts on this.

from tidytext.

pgensler commented on July 22, 2024

Hi Julia,

Not to beat this issue, but would you be willing to accept PR's for additional vignette's for your package? As I'm going through the NASA chapter in your book, I think that this information is definitely quite valuable, but I'm wondering if it may be helpful to have a one-page summary of resources to turn to for basic text tasks, similar to what you have outlined in your book?
Possibly:

Dealing with encoding in R, as this can definitely be a pain point (esp. UTF-8 strings)
JSON parsing, clearly lots of resources
Web scraping, as more than likely you will need to do it to acquire text from web pages

It's not meant to be a comprehensive tutorial, but I'm curious to see if you think this fits with the spirit of tidy-text. It may very well not be meant for your package, which is fine with me. Let me know, and I'd be willing to help put together a minimal vignette.

This also relates to the janeaustenr package used in tidytext:
https://cran.r-project.org/web/packages/utf8/vignettes/utf8.html

from tidytext.

juliasilge commented on July 22, 2024

Peter, thanks so much for your thoughtful comment, and I apologize for being slow to get back to you on this. I think this set of skills is SO IMPORTANT and I see a lot of value in compiling a good list of resources for text cleaning tasks. I literally was just responding to GitHub issues and blog comments from people with encoding issues for non-English languages. 😂 And knowing what to do with JSON or how to scrape the web is also super important.

Vignettes are typically about the code in a package, though, and how to use those functions. I've never seen a vignette that consists mainly of a list of resources external to that package; it is just not the right venue. If you put together this kind of guide and publish it, I would love to link to it from my blog, tweet about it, etc, but I don't think it's right for a vignette. Please let me know if you put together a valuable resource like this and publish it!

from tidytext.

pgensler commented on July 22, 2024

Thanks so much for your reply! That's totally understandable, I wish I had used twitter more and more to get help on these issues. Yeah I'm a bit torn as to where to house these resources as I seem to run into these issues more and more, especially with R.

I can understand why it would not make sense to have it located in a vignette. My only gripe is that if I'm a new R user, I want someone to get off on the right foot, and not merely drown in not knowing what to do (like with docker, shameless plug for my blog post: https://medium.com/@peterjgensler/creating-sandbox-environments-for-r-with-docker-def54e3491a3) (to be fair, I saw your dockerfile, and I restrained myself 😭)

Do you think it's appropriate to have a portion of the readme at the bottom dedicated to simply referring people to resources that may assist them in the data munging process? I know it's not entirely in-line with the package's intent, but it's there if someone needs it?

I'm currently in the process of a creating a blog post using this data as an example, but I feel like I've ran into so many issues that I'm not even sure how to "set someone up for success", given all the troubles I've ran into just trying to get it in a workable format. I wonder if most people (like myself 😃) enter into text mining assuming that you can just slam a large ish file (like the yelp dataset challenge) into this, and it should work just fine (which is definitely not the case 😂). Maybe this is just me, but I've just had a mirage of nightmares attempting to clean this data:
https://community.rstudio.com/t/split-uneven-length-vectors-to-columns-with-tidyr/2704/10
https://stackoverflow.com/questions/43218761/by-row-vs-rowwise-iteration/43236272?noredirect=1#comment75570823_43236272

To give you a bit of context, just merely tokenizing the raw data took over an hour with a different program (called Alteryx), and went from a 1GB file to 17GB (for one file, the other is here)

from tidytext.

juliasilge commented on July 22, 2024

A link in the README is a good idea; if you put together a list of resources, let me know and I can add it.

(And I will have to check out your Docker post; I am excited to learn more about good practices for Docker.)

from tidytext.

pgensler commented on July 22, 2024

Sounds good, I'll do a roundup, and see what I can come up with. As a new developer, Docker is so worthwhile but the pain......

from tidytext.

pgensler commented on July 22, 2024

FYI @juliasilge I intended to write about functions, but a good portion covered encoding issues in R:
https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77

from tidytext.

github-actions commented on July 22, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from tidytext.

Text Tidying tools for R? about tidytext HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent