vincentarelbundock / rdatasets Goto Github PK

View Code? Open in Web Editor NEW

289.0 289.0 430.0 190.06 MB

A collection of datasets originally distributed in R packages

Home Page: https://vincentarelbundock.github.io/Rdatasets

License: Other

HTML 99.59% Shell 0.06% R 0.35%

rdatasets's People

Stargazers

Watchers

Forkers

cram070 mandisis b-yeganeh iamcooi s4lbrecht sanketchavan03 smailsa97 rahulrajeev5 pandekalyani avisek140586 suryapratap25 dineshram456 hujiayin ashishsinghcode4fun gayathrikumari ccatcher nirgunjoshi ilanreinstein lakril saurabh1329 jeetendra-shakya guoyi0811 juliobernable anneclin prajnapanna uzumakipk123 tanmay2001choudhary lixiangyang06 carterwood2020 imkeshriraj kaistha23 sanu1230 nsetone sherry-fig ungisani daigong saurabh29may haneeshah eereshgowd siddharththorata sharonndolo ankitsuwal redfoxufa walterparadalozano ankit-gitscience pushpavj christophwuersch nadeemshaikh-github xenakas vipultank saurabh305jh c-prageesha manzar-123 realchiragsareen raashidshah ashijaved arrowcon amirmhj gurneet333 asixxx striver-hua thetextmining mrunalshete junyin1986 creativeheart pamita123 a-asadi81 goel-aman digitgeek cristianrodriguezds distroyer456 alexhallam leesb52 faishalfaye chuksoo skyee1 hieuminh0609 xyqiu1996 ketanbhalearo cwralston3 vishnu108 knaggita machdata abinayalogu judycaz wlckguillaume messeiry siddharth-star ritz0206 ngfkaola goncaloperes yuvirdha josuarmando ppluckyboy wanfar tiatiaoooo yanliangs yongbobloomington rohit-wagh sirprogrammerthe3rd

rdatasets's Issues

new dataset: radon

https://search.r-project.org/CRAN/refmans/HLMdiag/html/radon.html

new dataset: mstate

https://cran.r-project.org/web/packages/mstate/index.html

colliding paths

Hello,

thanks for the effort!
When on OSX (a case-insensitive system), I get the following warning; maybe something to look into...

warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

'csv/datasets/CO2.csv'
'csv/datasets/co2.csv'
'doc/datasets/CO2.html'
'doc/datasets/co2.html'
'doc/datasets/rst/CO2.rst'
'doc/datasets/rst/co2.rst'

missing Dataset on datasets

on the package datasets is missing this dataset:
datasets::state.x77

starwars dataset only has one row - seems incomplete

Thx for the cool package.

Check this link and you will see that the dataset only has one row.
https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv

The original has 87 rows: https://dplyr.tidyverse.org/reference/starwars.html

Potential esoph data inaccuracy.

Hello!

I've been working through some tutorials using the data provided by Tuyns.[1] It is described in some detail in both this stata bulletin about the population attributable fraction, and in these lecture slides.

In these descriptions, the sum of patients should be 975 (200 cases and 775 controls).

However, the esoph data:

library(datasets)
dat <- esoph
sum(dat$ncontrols) #=975
sum(dat$ncases) #= 200

This gives a total of 1175 records. I think an error has slipped in whereby controls is actually the total number of records (sum of cases and controls). If esoph$ncases is subtracted from esoph$ncontrols then this provides a true number of controls that matches the above descriptions.

One thing to add here, I am 99% sure that the esoph data comes from the Tuyns paper for a couple of reasons, but can't find the exact reference as I don't have access to the Breslow book quoted in the r documentation.

The reasons are:

The number of combinations is exactly the same (88 combinations of age, alcohol status, tobacco status)
The town of Ille-et Vilaine is mentioned in the r documentation of esoph
Counts and features otherwise match precisely, and odds calculated on the corrected data match those provided by tutorials.

A solution to this would be to subtract ncases from ncontrols to provide a total number column, a cases column and a controls column.

Happy to open a PR to do this if helpful. I couldn't find a github repo for the core-R datasets code so thought best to open this issue here first as I see you have that dataset included.

[1] Tuyns, A. J., G. Pequignot, and O. M. Jensen. 1977. Le cancer de l’oesophage en Ille-et Vilaine en fonction des niveaux de consommation d’alcool et de tabac. Bulletin of Cancer 64: 45–60

dragracer

add datasets with errors

I'll be a bit blunt here ;-).

The problem with all example datasets is that they are perfect: no missings, no outliers, no errors, no inconsistencies.

So no exercises in cleaning, which is what you do most of your time when you work with data.

Please add the SBS2000 data set from the 'validate' package to have something truly horrible in this list, and even people learning R for the first time can suffer from the terrible state that most data is in :-).

Cheers.
Mark

how to attach dates to the `EuStockMarkets` data set

Hi Vincent,

How do you attach real dates to the EuStockMarkets data set? I think that the mts column in the EuStockMarkets data set you are sharing is missing, which may help me to get to where I need to go. Would you add that back?

Thanks,

Package suggestion: Add wooldridge package

Hi - thanks for organizing this really cool resource!

I'd like to suggest to add datasets from the wooldridge package, which you can find on CRAN. If you'd like, I could also add a PR this or next weekend.

Best, Alex

Selection of data sets

Hi,
on https://vincentarelbundock.github.io/Rdatasets/articles/data.html you allow to search within the datasets. Would it be possible to select data sets based on conditions, e.g. all data sets with at least two numeric variables and one factor variable?

Best Sigbert

nycflights13

Hi Vincent,

nycflights13 is a quite widely used dataset for predictive modelling (quite largish though).

Here's the source: https://www.rdocumentation.org/packages/nycflights13/versions/1.0.1

Would you consider including it?

Cheers
Sebastian

Can't find penguins_raw in datasets.csv

Hi @vincentarelbundock and Rdatasets developers

There is a CSV file called penguins_raw.

However, this file is not listed in datasets.csv.

Thank you.

Add fpp3 datasets

Rdatasets contains the fpp2 datasets (fpp2 = Forecasting Principles and Practice, 2nd edition). I would like to have the fpp3 datasets available on Rdatasets. (fpp3 is the 3rd edition). These datasets are available on CRAN (library(fpp3)). I am willing to help do this and submit a PR if you can point me to some documentation, scripts, etc for guidance. Thanks

ggplot2movies dataset

Hi There,

movies from {ggplot2movies} is a nice data set. Here's the source: https://cran.r-project.org/web/packages/ggplot2movies/index.html

Would you consider adding it?

Dynamic html table (maybe DT)

new dataset: causaldata

Package lmec isn't available anymore

https://cran.r-project.org/web/packages/lmec/

Please remove it from DESCRIPTION.

Add two datasets from `survival` package

Hello,

I spotted these two datasets, they are in the R survival package, but not in your repository, is there a reason for this, or are you able to add them ?

rotterdam dataset https://rdrr.io/cran/survival/man/rotterdam.html
gbsg dataset https://rdrr.io/cran/survival/man/gbsg.html

Thanks for your work

new dataset: asaur

https://cran.r-project.org/web/packages/asaur/index.html

Add datasets from heplots and vcdExtra?

I have two packages with many datasets useful for teaching & research

heplots - multivariate linear models, MANOVA, discriminant analysis, ..
vcdExtra -- categorical data analysis & visualization

These might be useful to add to Rdatasets -- what would it take?

Also: take a look at vcdExtra::datasets() -- it gives a nice summary of datasets in a package.

Please add the datasets in the R packages tsibble and tsibbledata (used in fpp3)

The book fpp3 also refers to datasets in:

the CRAN package tsibble (2 datasets)
the CRAN package tsibbledata (12 datasets).

Please add these to Rdatasets. Thanks

p.s. thanks for adding the datasets from fpp3 itself

new dataset: openintro

Minor discrepancies between subfolder csvs and master sheet

Hello @vincentarelbundock -- thanks so much for providing these data.

I did a very quick scan through the data and documentation for the same. In particular, I was looking for any discrepancies between this main sheet and the names of the data sets themselves (as in name.csv) stored in the subfolders.

Here are some that are found that appear in the data as csvs, but not documented on the sheet. This was very rough and ready, and I might have missed something, but just in case it's helpful for your sweeps --

"aldh2" "apoeapoc" "bomregions2011" "bomregions2012"
"bomsoi2001" "cf" "cnv" "crohn"
"Damian" "fa" "fsnps"
"head.injury" "hla" "inf1"
"jma.cojo" "l51" "lukas" "mao"
"meyer" "mfblong" "mr" "nep499"
"PD"

For example, bomregions2012.csv appears in the DAAG subfolder, but not on that master sheet. And indeed, it has documentation here.

Again, thanks for all this work!

Exporting CSV with parameters row.names = FALSE and na = "" to avoid NA strings and get rid of row.names column?

What a nice collection of datasets! Thanks!

I tried loading data using duckdb and noticed that some numerical columns contain "NA" strings in the exported CSV. Those load nicely in R but often not elsewhere. I also saw that the row.names column seems to be included.

For example:

$ duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
┌─────────┬─────────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬─────────┬───────┐
│ column0 │ species │  island   │ bill_length_mm │ … │ flipper_length_mm │ body_mass_g │   sex   │ year  │
│  int64  │ varchar │  varchar  │    varchar     │   │      varchar      │   varchar   │ varchar │ int64 │
├─────────┼─────────┼───────────┼────────────────┼───┼───────────────────┼─────────────┼─────────┼───────┤
│       1 │ Adelie  │ Torgersen │ 39.1           │ … │ 181               │ 3750        │ male    │  2007 │
│       2 │ Adelie  │ Torgersen │ 39.5           │ … │ 186               │ 3800        │ female  │  2007 │
│       3 │ Adelie  │ Torgersen │ 40.3           │ … │ 195               │ 3250        │ female  │  2007 │
│       4 │ Adelie  │ Torgersen │ NA             │ … │ NA                │ NA          │ NA      │  2007 │
│       5 │ Adelie  │ Torgersen │ 36.7           │ … │ 193               │ 3450        │ female  │  2007 │
├─────────┴─────────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴─────────┴───────┤
│ 5 rows                                                                             9 columns (8 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Compare the data types for columns bill_length_mm bill_depth_mm flipper_length_mm body_mass_g with the R dataset:

> palmerpenguins::penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4 Adelie  Torgersen           NA            NA                  NA          NA NA      2007

Could exporting using write.csv(x, na = "", row.names = FALSE) alleviate this issue? For example here.

Issues with some datasets

@kojix2 reported the following:

Hi @vincentarelbundock and Rdatasets developers!

Thank you for creating a very useful repository.

I found 3 CSV files that have a different name than the dataset.csv description.
kojix2/RDatasets#4

Stat2Data has two identical data sets. InfantMortality and InfantMortality2010. But it seems only InfantMortality2010 is described in datasets.csv.
Stat2Data/Leafhoppers.csv is described as LeafHoppers in dataset.csv
survival/genfan.csv does not exist in datasets.csv

Thanks.

vincentarelbundock / rdatasets Goto Github PK

rdatasets's People

Stargazers

Watchers

Forkers

rdatasets's Issues

Recommend Projects

Recommend Topics

Recommend Org