Giter Site home page Giter Site logo

Comments (8)

rudeboybert avatar rudeboybert commented on July 24, 2024 3

Hi all, I'm one of the co-authors of the original 2015 Journal of Statistics Education (JSE) article recipes::okc and modeldata::okc_text are based on. While we obtained explicit permission from then OkCupid CEO Christian Rudder to use this data and did not include usernames, as @juliasilge points out there were still some outstanding issues (example).

As such my co-author @escoland, along with the Taylor & Francis editorial team, recently revised the paper and datasets as follows:

  • The essay data was
    • Randomly row shuffled, thereby uncoupling individual user profile data from individual essay
    • URL's to the following websites were removed: Facebook, Instagram, Twitter, Pinterest, Flickr
  • Random noise was added to the age variable: uniformly selected between {-1, 0, +1}
  • The location and last_online variables were removed
  • The date the data were collected was recast as "a period in the 2010s"

While by no means a perfect privacy guarantee in the differential privacy sense, especially when combined with auxiliary datasets, we all felt this still went a long way.

I was in the process of revising all derivative products when I came across this thread (I vaguely recall @topepo using it for a tidymodels related project somewhere). All things being equal, just like OP @EmilHvitfeldt I too would favor replacing the use of okc and okc_data with another dataset, unless as part of a lesson on data ethics (see below). However, if the disruption to existing code would be too great, at the very least I recommend using the revised version of the data. For example profiles_revised and essay0_revised_and_shuffled datasets in the forthcoming update to the okcupiddata packge to CRAN.

On a personal note, I've moved away from using this dataset as a pure example dataset, and now only use and encourage its use for "teachable moment" meta-lessons on data ethics and privacy. For example, my Smith College SDS colleague @beanumber used it for the data ethics module of his capstone course. Two SDS students @tiffanyxiao @yifan6210 penned this thought-provoking and well researched letter to the (JSE) editor that will be published shortly. In fact, this letter formed the basis of the above-mentioned revisions.

from modeldata.

juliasilge avatar juliasilge commented on July 24, 2024 1

It looks like if we use filter(year > 1980) and set artist and medium to be factors, it will compress to just at 100.1 KB. When I run R CMD check, I don't get any errors about the size of the package so I think we can add this in one release before removing the OKCupid dataset at the next release after that.

from modeldata.

juliasilge avatar juliasilge commented on July 24, 2024

Data scraped from OkCupid got a lot of press in 2016 or so when that particularly problematic release happened. A main difference between the paper for the particular source used in modeldata and the ones linked above is that usernames were not included in the version that is in modeldata. Other than that, it is effectively the same. It is of course better not to include usernames, but there are still issues to think through.

People who think about research with public social media data have done interesting work in recent years, like this paper about people's attitudes and expectations for research using public social media data. The recent #medbikini paper (and retraction) is another example of how public social media data being public isn't good enough as a reason to use a dataset.

Anyway, worth some thought.

from modeldata.

juliasilge avatar juliasilge commented on July 24, 2024

Wow, that is so great @rudeboybert! πŸ™ŒπŸ’•πŸ’«

My opinion is that finding another text data set appropriate for examples, README, etc is the best option for tidymodels. I can start to work on that task soon and get us replacing this.

from modeldata.

juliasilge avatar juliasilge commented on July 24, 2024

I've been thinking about this some more and one "replacement" I'd like to consider for okc_text is the Tate Collection data set from Tidy Tuesday earlier this year. You can check out some modeling I did with it here.

library(tidyverse)
artwork <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-12/artwork.csv')
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   .default = col_character(),
#>   id = col_double(),
#>   artistId = col_double(),
#>   year = col_double(),
#>   acquisitionYear = col_double(),
#>   width = col_double(),
#>   height = col_double(),
#>   depth = col_double(),
#>   thumbnailCopyright = col_logical()
#> )
#> β„Ή Use `spec()` for the full column specifications.
artwork %>% 
  filter(year > 1970, artistRole == "artist") %>% 
  select(id, artist, title, medium, year, dimensions)
#> # A tibble: 11,356 x 6
#>       id artist    title                medium       year dimensions            
#>    <dbl> <chr>     <chr>                <chr>       <dbl> <chr>                 
#>  1  5572 Grant, D… Kinetic Realisation… Film         1974 <NA>                  
#>  2 98173 Katz, Al… West Window          Oil paint …  1979 support: 196 x 238 x …
#>  3 98174 Katz, Al… Lillies Against Yel… Oil paint …  1983 support: 307 x 229 x …
#>  4 98175 Katz, Al… Young Trees          Oil paint …  1989 support: 407 x 301 x …
#>  5 98177 Katz, Al… Daisies #2           Oil paint …  1992 support: 231 x 320 x …
#>  6 98178 Katz, Al… Ocean View           Oil paint …  1992 support: 231 x 320 x …
#>  7 98181 Katz, Al… Winter Branch        Oil paint …  1993 support: 230 x 302 x …
#>  8 98182 Katz, Al… Night Branch         Oil paint …  1994 support: 302 x 230 x …
#>  9 98183 Katz, Al… West Palm Beach      Oil paint …  1997 support: 228 x 299 x …
#> 10 98184 Katz, Al… 3 PM, November       Oil paint …  1997 support: 231 x 302 mm…
#> # … with 11,346 more rows

Created on 2021-06-26 by the reprex package (v2.0.0)

Pros

  • We can filter down by year to something that is of appropriate size for inclusion in examples and still makes sense as a corpus, e.g. "art from the Tate Collection created after 1970" or whatever.
  • It has multiple short text variables that can be used for various kinds of tokenization, like artist, title, medium.

Cons

  • There isn't really similar to okc here to use as a drop-in replacement so we'd want to look for other data sets to use there.
  • If the goal is to use year as the outcome, modeling results aren't super stellar:
    image

Maybe that is fine for examples, though. Thoughts?

from modeldata.

EmilHvitfeldt avatar EmilHvitfeldt commented on July 24, 2024

I think this would be a great candidate as a replacement for okc_text!

Text fields are fairly small, but that is not a bad thing since they will fit better in documentation material.

modeling results are not perfect, which I don't find to be the biggest problem. Maybe binning year into something sensible would produce something good?

from modeldata.

juliasilge avatar juliasilge commented on July 24, 2024

Just so we remember our plan:

  • the new Tate data set is now added and the OkC datasets are described as deprecated, for the 0.1.1 release
  • for the 0.1.2 release, we will remove the OkC dataset entirely from the package

Or maybe I've got the version numbers wrong. @topepo how would you number the next two releases?

from modeldata.

github-actions avatar github-actions commented on July 24, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from modeldata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.