Comments (9)
Hello there, Peter! Cleaning text is definitely a challenge. In the book, we went the direction of including information on cleaning text within the case studies, not in a separate chapters. There are notes on how we cleaned text in the chapter on NASA metadata (which included HTML tags), the Usenet chapter (where we cleaned headers and such), and the Twitter chapter (where we talked about using regex to extract words and Twitter hashtags/usernames/etc).
There are certainly a lot of strategies to use! They are so specific to each dataset that we chose to show how various strategies can be applied in context. These are mostly of the stringr/regex variety. Thanks for sharing your input.
from tidytext.
Julia,
Thanks for your comments. Is there any way to contribute an example to the book? I'm not sure if you have published the book or plan on it. The only reason I ask is that I've encountered so many issues with this single project that it may be worthwhile to include as a case study or vignette if you are willing, as I think it would help others with text mining. Let me know your thoughts on this.
from tidytext.
Hi Julia,
Not to beat this issue, but would you be willing to accept PR's for additional vignette's for your package? As I'm going through the NASA chapter in your book, I think that this information is definitely quite valuable, but I'm wondering if it may be helpful to have a one-page summary of resources to turn to for basic text tasks, similar to what you have outlined in your book?
Possibly:
- Dealing with encoding in R, as this can definitely be a pain point (esp. UTF-8 strings)
- JSON parsing, clearly lots of resources
- Web scraping, as more than likely you will need to do it to acquire text from web pages
It's not meant to be a comprehensive tutorial, but I'm curious to see if you think this fits with the spirit of tidy-text. It may very well not be meant for your package, which is fine with me. Let me know, and I'd be willing to help put together a minimal vignette.
This also relates to the janeaustenr package used in tidytext:
https://cran.r-project.org/web/packages/utf8/vignettes/utf8.html
from tidytext.
Peter, thanks so much for your thoughtful comment, and I apologize for being slow to get back to you on this. I think this set of skills is SO IMPORTANT and I see a lot of value in compiling a good list of resources for text cleaning tasks. I literally was just responding to GitHub issues and blog comments from people with encoding issues for non-English languages. 😂 And knowing what to do with JSON or how to scrape the web is also super important.
Vignettes are typically about the code in a package, though, and how to use those functions. I've never seen a vignette that consists mainly of a list of resources external to that package; it is just not the right venue. If you put together this kind of guide and publish it, I would love to link to it from my blog, tweet about it, etc, but I don't think it's right for a vignette. Please let me know if you put together a valuable resource like this and publish it!
from tidytext.
Thanks so much for your reply! That's totally understandable, I wish I had used twitter more and more to get help on these issues. Yeah I'm a bit torn as to where to house these resources as I seem to run into these issues more and more, especially with R.
I can understand why it would not make sense to have it located in a vignette. My only gripe is that if I'm a new R user, I want someone to get off on the right foot, and not merely drown in not knowing what to do (like with docker, shameless plug for my blog post: https://medium.com/@peterjgensler/creating-sandbox-environments-for-r-with-docker-def54e3491a3) (to be fair, I saw your dockerfile, and I restrained myself 😭)
Do you think it's appropriate to have a portion of the readme at the bottom dedicated to simply referring people to resources that may assist them in the data munging process? I know it's not entirely in-line with the package's intent, but it's there if someone needs it?
I'm currently in the process of a creating a blog post using this data as an example, but I feel like I've ran into so many issues that I'm not even sure how to "set someone up for success", given all the troubles I've ran into just trying to get it in a workable format. I wonder if most people (like myself 😃) enter into text mining assuming that you can just slam a large ish file (like the yelp dataset challenge) into this, and it should work just fine (which is definitely not the case 😂). Maybe this is just me, but I've just had a mirage of nightmares attempting to clean this data:
https://community.rstudio.com/t/split-uneven-length-vectors-to-columns-with-tidyr/2704/10
https://stackoverflow.com/questions/43218761/by-row-vs-rowwise-iteration/43236272?noredirect=1#comment75570823_43236272
To give you a bit of context, just merely tokenizing the raw data took over an hour with a different program (called Alteryx), and went from a 1GB file to 17GB (for one file, the other is here)
from tidytext.
A link in the README is a good idea; if you put together a list of resources, let me know and I can add it.
(And I will have to check out your Docker post; I am excited to learn more about good practices for Docker.)
from tidytext.
Sounds good, I'll do a roundup, and see what I can come up with. As a new developer, Docker is so worthwhile but the pain......
from tidytext.
FYI @juliasilge I intended to write about functions, but a good portion covered encoding issues in R:
https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77
from tidytext.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
from tidytext.
Related Issues (20)
- Filter Harmful Words? HOT 3
- Parts_of_speech Expected Behavior HOT 3
- unnest_tokens on large corpus with limited RAM HOT 3
- Error in qr.lm(thetasims[, k], qx) HOT 7
- Release tidytext 0.3.4 HOT 2
- Make hunspell a Suggested dependency? HOT 2
- scales::wrap_format() does not work with scale_*_reordered HOT 4
- Change to new cran checks badge URL HOT 3
- include algo of tidy autostemmer HOT 1
- Release tidytext 0.4.0 HOT 1
- Release tidytext 0.4.1 HOT 1
- Support tidy.BTM
- [Accessibility issue] Vignettes lacks alt= in plots HOT 1
- Replacement of `token = "tweets"` HOT 3
- tidytext idf negative values due to wrong counting of number of documents HOT 2
- Error in cast_dfm() function HOT 2
- Speed improvement for bind_tf_idf HOT 5
- Disregard this issue. HOT 1
- Switch over to using native pipe
- Release tidytext 0.4.2 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tidytext.