Comments (8)
Hi all, I'm one of the co-authors of the original 2015 Journal of Statistics Education (JSE) article recipes::okc
and modeldata::okc_text
are based on. While we obtained explicit permission from then OkCupid CEO Christian Rudder to use this data and did not include usernames, as @juliasilge points out there were still some outstanding issues (example).
As such my co-author @escoland, along with the Taylor & Francis editorial team, recently revised the paper and datasets as follows:
- The essay data was
- Randomly row shuffled, thereby uncoupling individual user profile data from individual essay
- URL's to the following websites were removed: Facebook, Instagram, Twitter, Pinterest, Flickr
- Random noise was added to the
age
variable: uniformly selected between {-1, 0, +1} - The
location
andlast_online
variables were removed - The date the data were collected was recast as "a period in the 2010s"
While by no means a perfect privacy guarantee in the differential privacy sense, especially when combined with auxiliary datasets, we all felt this still went a long way.
I was in the process of revising all derivative products when I came across this thread (I vaguely recall @topepo using it for a tidymodels related project somewhere). All things being equal, just like OP @EmilHvitfeldt I too would favor replacing the use of okc
and okc_data
with another dataset, unless as part of a lesson on data ethics (see below). However, if the disruption to existing code would be too great, at the very least I recommend using the revised version of the data. For example profiles_revised
and essay0_revised_and_shuffled
datasets in the forthcoming update to the okcupiddata
packge to CRAN.
On a personal note, I've moved away from using this dataset as a pure example dataset, and now only use and encourage its use for "teachable moment" meta-lessons on data ethics and privacy. For example, my Smith College SDS colleague @beanumber used it for the data ethics module of his capstone course. Two SDS students @tiffanyxiao @yifan6210 penned this thought-provoking and well researched letter to the (JSE) editor that will be published shortly. In fact, this letter formed the basis of the above-mentioned revisions.
from modeldata.
It looks like if we use filter(year > 1980)
and set artist and medium to be factors, it will compress to just at 100.1 KB. When I run R CMD check, I don't get any errors about the size of the package so I think we can add this in one release before removing the OKCupid dataset at the next release after that.
from modeldata.
Data scraped from OkCupid got a lot of press in 2016 or so when that particularly problematic release happened. A main difference between the paper for the particular source used in modeldata and the ones linked above is that usernames were not included in the version that is in modeldata. Other than that, it is effectively the same. It is of course better not to include usernames, but there are still issues to think through.
People who think about research with public social media data have done interesting work in recent years, like this paper about people's attitudes and expectations for research using public social media data. The recent #medbikini paper (and retraction) is another example of how public social media data being public isn't good enough as a reason to use a dataset.
Anyway, worth some thought.
from modeldata.
Wow, that is so great @rudeboybert!
My opinion is that finding another text data set appropriate for examples, README, etc is the best option for tidymodels. I can start to work on that task soon and get us replacing this.
from modeldata.
I've been thinking about this some more and one "replacement" I'd like to consider for okc_text
is the Tate Collection data set from Tidy Tuesday earlier this year. You can check out some modeling I did with it here.
library(tidyverse)
artwork <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-12/artwork.csv')
#>
#> ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> cols(
#> .default = col_character(),
#> id = col_double(),
#> artistId = col_double(),
#> year = col_double(),
#> acquisitionYear = col_double(),
#> width = col_double(),
#> height = col_double(),
#> depth = col_double(),
#> thumbnailCopyright = col_logical()
#> )
#> βΉ Use `spec()` for the full column specifications.
artwork %>%
filter(year > 1970, artistRole == "artist") %>%
select(id, artist, title, medium, year, dimensions)
#> # A tibble: 11,356 x 6
#> id artist title medium year dimensions
#> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 5572 Grant, D⦠Kinetic Realisation⦠Film 1974 <NA>
#> 2 98173 Katz, Alβ¦ West Window Oil paint β¦ 1979 support: 196 x 238 x β¦
#> 3 98174 Katz, Alβ¦ Lillies Against Yelβ¦ Oil paint β¦ 1983 support: 307 x 229 x β¦
#> 4 98175 Katz, Alβ¦ Young Trees Oil paint β¦ 1989 support: 407 x 301 x β¦
#> 5 98177 Katz, Alβ¦ Daisies #2 Oil paint β¦ 1992 support: 231 x 320 x β¦
#> 6 98178 Katz, Alβ¦ Ocean View Oil paint β¦ 1992 support: 231 x 320 x β¦
#> 7 98181 Katz, Alβ¦ Winter Branch Oil paint β¦ 1993 support: 230 x 302 x β¦
#> 8 98182 Katz, Alβ¦ Night Branch Oil paint β¦ 1994 support: 302 x 230 x β¦
#> 9 98183 Katz, Alβ¦ West Palm Beach Oil paint β¦ 1997 support: 228 x 299 x β¦
#> 10 98184 Katz, Alβ¦ 3 PM, November Oil paint β¦ 1997 support: 231 x 302 mmβ¦
#> # β¦ with 11,346 more rows
Created on 2021-06-26 by the reprex package (v2.0.0)
Pros
- We can filter down by year to something that is of appropriate size for inclusion in examples and still makes sense as a corpus, e.g. "art from the Tate Collection created after 1970" or whatever.
- It has multiple short text variables that can be used for various kinds of tokenization, like
artist
,title
,medium
.
Cons
- There isn't really similar to
okc
here to use as a drop-in replacement so we'd want to look for other data sets to use there. - If the goal is to use
year
as the outcome, modeling results aren't super stellar:
Maybe that is fine for examples, though. Thoughts?
from modeldata.
I think this would be a great candidate as a replacement for okc_text
!
Text fields are fairly small, but that is not a bad thing since they will fit better in documentation material.
modeling results are not perfect, which I don't find to be the biggest problem. Maybe binning year
into something sensible would produce something good?
from modeldata.
Just so we remember our plan:
- the new Tate data set is now added and the OkC datasets are described as deprecated, for the 0.1.1 release
- for the 0.1.2 release, we will remove the OkC dataset entirely from the package
Or maybe I've got the version numbers wrong. @topepo how would you number the next two releases?
from modeldata.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
from modeldata.
Related Issues (20)
- Release modeldata 0.1.0 HOT 1
- Title of "grants" data set is incorrectly labelled "Ames Housing Data" HOT 1
- Wrong linkage to tidymodels/tidymodels HOT 3
- Fix description of "grants" dataset HOT 2
- Release modeldata 0.1.1 HOT 2
- Release modeldata 0.1.1 HOT 2
- Add str(dataset) to all data sets HOT 1
- Move `master` branch to `main` HOT 2
- Multi class data set HOT 3
- Upkeep for modeldata HOT 1
- URL for rcompanion is currently bad HOT 1
- Release modeldata 0.2.0 HOT 2
- call `str(dataset)` in examples HOT 2
- Have data sets with character variables HOT 1
- Can't access to data sets using `::` HOT 5
- Release modeldata 1.0.0 HOT 1
- Release modeldata 1.0.0 HOT 1
- Release modeldata 1.0.1 HOT 1
- Release modeldata 1.1.0 HOT 1
- Upkeep for modeldata HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from modeldata.