vincentarelbundock / rdatasets Goto Github PK
View Code? Open in Web Editor NEWA collection of datasets originally distributed in R packages
Home Page: https://vincentarelbundock.github.io/Rdatasets
License: Other
A collection of datasets originally distributed in R packages
Home Page: https://vincentarelbundock.github.io/Rdatasets
License: Other
Hello,
thanks for the effort!
When on OSX (a case-insensitive system), I get the following warning; maybe something to look into...
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:'csv/datasets/CO2.csv'
'csv/datasets/co2.csv'
'doc/datasets/CO2.html'
'doc/datasets/co2.html'
'doc/datasets/rst/CO2.rst'
'doc/datasets/rst/co2.rst'
on the package datasets is missing this dataset:
datasets::state.x77
Thx for the cool package.
Check this link and you will see that the dataset only has one row.
https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv
The original has 87 rows: https://dplyr.tidyverse.org/reference/starwars.html
Hello!
I've been working through some tutorials using the data provided by Tuyns.[1] It is described in some detail in both this stata bulletin about the population attributable fraction, and in these lecture slides.
In these descriptions, the sum of patients should be 975 (200 cases and 775 controls).
However, the esoph data:
library(datasets)
dat <- esoph
sum(dat$ncontrols) #=975
sum(dat$ncases) #= 200
This gives a total of 1175 records. I think an error has slipped in whereby controls is actually the total number of records (sum of cases and controls). If esoph$ncases
is subtracted from esoph$ncontrols
then this provides a true number of controls that matches the above descriptions.
One thing to add here, I am 99% sure that the esoph data comes from the Tuyns paper for a couple of reasons, but can't find the exact reference as I don't have access to the Breslow book quoted in the r documentation.
The reasons are:
The number of combinations is exactly the same (88 combinations of age, alcohol status, tobacco status)
The town of Ille-et Vilaine is mentioned in the r documentation of esoph
Counts and features otherwise match precisely, and odds calculated on the corrected data match those provided by tutorials.
A solution to this would be to subtract ncases from ncontrols to provide a total number column, a cases column and a controls column.
Happy to open a PR to do this if helpful. I couldn't find a github repo for the core-R datasets code so thought best to open this issue here first as I see you have that dataset included.
[1] Tuyns, A. J., G. Pequignot, and O. M. Jensen. 1977. Le cancer de l’oesophage en Ille-et Vilaine en fonction des niveaux de consommation d’alcool et de tabac. Bulletin of Cancer 64: 45–60
I'll be a bit blunt here ;-).
The problem with all example datasets is that they are perfect: no missings, no outliers, no errors, no inconsistencies.
So no exercises in cleaning, which is what you do most of your time when you work with data.
Please add the SBS2000 data set from the 'validate' package to have something truly horrible in this list, and even people learning R for the first time can suffer from the terrible state that most data is in :-).
Cheers.
Mark
Hi Vincent,
How do you attach real dates to the EuStockMarkets data set? I think that the mts
column in the EuStockMarkets data set you are sharing is missing, which may help me to get to where I need to go. Would you add that back?
Thanks,
Hi - thanks for organizing this really cool resource!
I'd like to suggest to add datasets from the wooldridge
package, which you can find on CRAN. If you'd like, I could also add a PR this or next weekend.
Best, Alex
Hi,
on https://vincentarelbundock.github.io/Rdatasets/articles/data.html you allow to search within the datasets. Would it be possible to select data sets based on conditions, e.g. all data sets with at least two numeric variables and one factor variable?
Best Sigbert
Hi Vincent,
nycflights13
is a quite widely used dataset for predictive modelling (quite largish though).
Here's the source: https://www.rdocumentation.org/packages/nycflights13/versions/1.0.1
Would you consider including it?
Cheers
Sebastian
Hi @vincentarelbundock and Rdatasets developers
There is a CSV file called penguins_raw.
However, this file is not listed in datasets.csv.
Thank you.
Rdatasets contains the fpp2 datasets (fpp2 = Forecasting Principles and Practice, 2nd edition). I would like to have the fpp3 datasets available on Rdatasets. (fpp3 is the 3rd edition). These datasets are available on CRAN (library(fpp3)). I am willing to help do this and submit a PR if you can point me to some documentation, scripts, etc for guidance. Thanks
Hi There,
movies
from {ggplot2movies}
is a nice data set. Here's the source: https://cran.r-project.org/web/packages/ggplot2movies/index.html
Would you consider adding it?
https://cran.r-project.org/web/packages/lmec/
Please remove it from DESCRIPTION.
Hello,
I spotted these two datasets, they are in the R survival
package, but not in your repository, is there a reason for this, or are you able to add them ?
rotterdam
dataset https://rdrr.io/cran/survival/man/rotterdam.htmlgbsg
dataset https://rdrr.io/cran/survival/man/gbsg.htmlThanks for your work
I have two packages with many datasets useful for teaching & research
heplots - multivariate linear models, MANOVA, discriminant analysis, ..
vcdExtra -- categorical data analysis & visualization
These might be useful to add to Rdatasets -- what would it take?
Also: take a look at vcdExtra::datasets()
-- it gives a nice summary of datasets in a package.
The book fpp3 also refers to datasets in:
Please add these to Rdatasets. Thanks
p.s. thanks for adding the datasets from fpp3 itself
Hello @vincentarelbundock -- thanks so much for providing these data.
I did a very quick scan through the data and documentation for the same. In particular, I was looking for any discrepancies between this main sheet and the names of the data sets themselves (as in name
.csv) stored in the subfolders.
Here are some that are found that appear in the data as csvs, but not documented on the sheet. This was very rough and ready, and I might have missed something, but just in case it's helpful for your sweeps --
"aldh2" "apoeapoc" "bomregions2011" "bomregions2012"
"bomsoi2001" "cf" "cnv" "crohn"
"Damian" "fa" "fsnps"
"head.injury" "hla" "inf1"
"jma.cojo" "l51" "lukas" "mao"
"meyer" "mfblong" "mr" "nep499"
"PD"
For example, bomregions2012.csv
appears in the DAAG
subfolder, but not on that master sheet. And indeed, it has documentation here.
Again, thanks for all this work!
What a nice collection of datasets! Thanks!
I tried loading data using duckdb
and noticed that some numerical columns contain "NA" strings in the exported CSV. Those load nicely in R but often not elsewhere. I also saw that the row.names column seems to be included.
For example:
$ duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
┌─────────┬─────────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬─────────┬───────┐
│ column0 │ species │ island │ bill_length_mm │ … │ flipper_length_mm │ body_mass_g │ sex │ year │
│ int64 │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │ int64 │
├─────────┼─────────┼───────────┼────────────────┼───┼───────────────────┼─────────────┼─────────┼───────┤
│ 1 │ Adelie │ Torgersen │ 39.1 │ … │ 181 │ 3750 │ male │ 2007 │
│ 2 │ Adelie │ Torgersen │ 39.5 │ … │ 186 │ 3800 │ female │ 2007 │
│ 3 │ Adelie │ Torgersen │ 40.3 │ … │ 195 │ 3250 │ female │ 2007 │
│ 4 │ Adelie │ Torgersen │ NA │ … │ NA │ NA │ NA │ 2007 │
│ 5 │ Adelie │ Torgersen │ 36.7 │ … │ 193 │ 3450 │ female │ 2007 │
├─────────┴─────────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴─────────┴───────┤
│ 5 rows 9 columns (8 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Compare the data types for columns bill_length_mm bill_depth_mm flipper_length_mm body_mass_g with the R dataset:
> palmerpenguins::penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA NA 2007
Could exporting using write.csv(x, na = "", row.names = FALSE)
alleviate this issue? For example here.
@kojix2 reported the following:
Hi @vincentarelbundock and Rdatasets developers!
Thank you for creating a very useful repository.
I found 3 CSV files that have a different name than the dataset.csv
description.
kojix2/RDatasets#4
InfantMortality
and InfantMortality2010
. But it seems only InfantMortality2010
is described in datasets.csv.Stat2Data/Leafhoppers.csv
is described as LeafHoppers
in dataset.csv
survival/genfan.csv
does not exist in datasets.csv
Thanks.
My mistake. It's been already added.
Nice idea for a site :)
There's also the modeldata package https://cran.r-project.org/package=modeldata, which is part of tidymodels
First of all, thanks for your valuable resource, it makes it easy to point to students to datasets they can use to learn and experiment.
I was looking at the Snow.*
datasets from HistData, and noticed that the Snow.polygons
data is missing from the list.
Row 18 of the O-Rings dataset has a "Blowby" value of 2 but a "Total" value of only 1. Since "Total" equals the sum of "Blowby" and "Erosion" values, "Total" in this case should be 2 rather than 1.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.