There was some controversy with this dataset and OkCupid data in general, <a href="htt

Wow, that is so great <a class="user-mention notranslate" data-hovercard-type="user" d

I think this would be a great candidate as a replacement for <code class="notranslate"

Just so we remember our plan: the new Tate data set is now add

Consider replacing okc_text dataset about modeldata HOT 8 CLOSED

tidymodels commented on July 24, 2024

Consider replacing okc_text dataset

from modeldata.

Comments (8)

rudeboybert commented on July 24, 2024 3

Hi all, I'm one of the co-authors of the original 2015 Journal of Statistics Education (JSE) article recipes::okc and modeldata::okc_text are based on. While we obtained explicit permission from then OkCupid CEO Christian Rudder to use this data and did not include usernames, as @juliasilge points out there were still some outstanding issues (example).

As such my co-author @escoland, along with the Taylor & Francis editorial team, recently revised the paper and datasets as follows:

The essay data was
- Randomly row shuffled, thereby uncoupling individual user profile data from individual essay
- URL's to the following websites were removed: Facebook, Instagram, Twitter, Pinterest, Flickr
Random noise was added to the age variable: uniformly selected between {-1, 0, +1}
The location and last_online variables were removed
The date the data were collected was recast as "a period in the 2010s"

While by no means a perfect privacy guarantee in the differential privacy sense, especially when combined with auxiliary datasets, we all felt this still went a long way.

I was in the process of revising all derivative products when I came across this thread (I vaguely recall @topepo using it for a tidymodels related project somewhere). All things being equal, just like OP @EmilHvitfeldt I too would favor replacing the use of okc and okc_data with another dataset, unless as part of a lesson on data ethics (see below). However, if the disruption to existing code would be too great, at the very least I recommend using the revised version of the data. For example profiles_revised and essay0_revised_and_shuffled datasets in the forthcoming update to the okcupiddata packge to CRAN.

On a personal note, I've moved away from using this dataset as a pure example dataset, and now only use and encourage its use for "teachable moment" meta-lessons on data ethics and privacy. For example, my Smith College SDS colleague @beanumber used it for the data ethics module of his capstone course. Two SDS students @tiffanyxiao @yifan6210 penned this thought-provoking and well researched letter to the (JSE) editor that will be published shortly. In fact, this letter formed the basis of the above-mentioned revisions.

from modeldata.

juliasilge commented on July 24, 2024 1

It looks like if we use filter(year > 1980) and set artist and medium to be factors, it will compress to just at 100.1 KB. When I run R CMD check, I don't get any errors about the size of the package so I think we can add this in one release before removing the OKCupid dataset at the next release after that.

from modeldata.

juliasilge commented on July 24, 2024

Data scraped from OkCupid got a lot of press in 2016 or so when that particularly problematic release happened. A main difference between the paper for the particular source used in modeldata and the ones linked above is that usernames were not included in the version that is in modeldata. Other than that, it is effectively the same. It is of course better not to include usernames, but there are still issues to think through.

People who think about research with public social media data have done interesting work in recent years, like this paper about people's attitudes and expectations for research using public social media data. The recent #medbikini paper (and retraction) is another example of how public social media data being public isn't good enough as a reason to use a dataset.

Anyway, worth some thought.

from modeldata.

juliasilge commented on July 24, 2024

Wow, that is so great @rudeboybert! 🙌💕💫

My opinion is that finding another text data set appropriate for examples, README, etc is the best option for tidymodels. I can start to work on that task soon and get us replacing this.

from modeldata.

juliasilge commented on July 24, 2024

I've been thinking about this some more and one "replacement" I'd like to consider for okc_text is the Tate Collection data set from Tidy Tuesday earlier this year. You can check out some modeling I did with it here.

library(tidyverse)
artwork <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-12/artwork.csv')
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   .default = col_character(),
#>   id = col_double(),
#>   artistId = col_double(),
#>   year = col_double(),
#>   acquisitionYear = col_double(),
#>   width = col_double(),
#>   height = col_double(),
#>   depth = col_double(),
#>   thumbnailCopyright = col_logical()
#> )
#> ℹ Use `spec()` for the full column specifications.
artwork %>% 
  filter(year > 1970, artistRole == "artist") %>% 
  select(id, artist, title, medium, year, dimensions)
#> # A tibble: 11,356 x 6
#>       id artist    title                medium       year dimensions            
#>    <dbl> <chr>     <chr>                <chr>       <dbl> <chr>                 
#>  1  5572 Grant, D… Kinetic Realisation… Film         1974 <NA>                  
#>  2 98173 Katz, Al… West Window          Oil paint …  1979 support: 196 x 238 x …
#>  3 98174 Katz, Al… Lillies Against Yel… Oil paint …  1983 support: 307 x 229 x …
#>  4 98175 Katz, Al… Young Trees          Oil paint …  1989 support: 407 x 301 x …
#>  5 98177 Katz, Al… Daisies #2           Oil paint …  1992 support: 231 x 320 x …
#>  6 98178 Katz, Al… Ocean View           Oil paint …  1992 support: 231 x 320 x …
#>  7 98181 Katz, Al… Winter Branch        Oil paint …  1993 support: 230 x 302 x …
#>  8 98182 Katz, Al… Night Branch         Oil paint …  1994 support: 302 x 230 x …
#>  9 98183 Katz, Al… West Palm Beach      Oil paint …  1997 support: 228 x 299 x …
#> 10 98184 Katz, Al… 3 PM, November       Oil paint …  1997 support: 231 x 302 mm…
#> # … with 11,346 more rows

^{Created on 2021-06-26 by the reprex package (v2.0.0)}

Pros

We can filter down by year to something that is of appropriate size for inclusion in examples and still makes sense as a corpus, e.g. "art from the Tate Collection created after 1970" or whatever.
It has multiple short text variables that can be used for various kinds of tokenization, like artist, title, medium.

Cons

There isn't really similar to okc here to use as a drop-in replacement so we'd want to look for other data sets to use there.
If the goal is to use year as the outcome, modeling results aren't super stellar:

Maybe that is fine for examples, though. Thoughts?

from modeldata.

EmilHvitfeldt commented on July 24, 2024

I think this would be a great candidate as a replacement for okc_text!

Text fields are fairly small, but that is not a bad thing since they will fit better in documentation material.

modeling results are not perfect, which I don't find to be the biggest problem. Maybe binning year into something sensible would produce something good?

from modeldata.

juliasilge commented on July 24, 2024

Just so we remember our plan:

the new Tate data set is now added and the OkC datasets are described as deprecated, for the 0.1.1 release
for the 0.1.2 release, we will remove the OkC dataset entirely from the package

Or maybe I've got the version numbers wrong. @topepo how would you number the next two releases?

from modeldata.

github-actions commented on July 24, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from modeldata.

Consider replacing okc_text dataset about modeldata HOT 8 CLOSED

Comments (8)

Pros

Cons

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent