Giter Site home page Giter Site logo

vincentarelbundock / rdatasets Goto Github PK

View Code? Open in Web Editor NEW
289.0 289.0 430.0 190.06 MB

A collection of datasets originally distributed in R packages

Home Page: https://vincentarelbundock.github.io/Rdatasets

License: Other

HTML 99.59% Shell 0.06% R 0.35%

rdatasets's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

rdatasets's Issues

colliding paths

Hello,

thanks for the effort!
When on OSX (a case-insensitive system), I get the following warning; maybe something to look into...

warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

'csv/datasets/CO2.csv'
'csv/datasets/co2.csv'
'doc/datasets/CO2.html'
'doc/datasets/co2.html'
'doc/datasets/rst/CO2.rst'
'doc/datasets/rst/co2.rst'

Potential esoph data inaccuracy.

Hello!

I've been working through some tutorials using the data provided by Tuyns.[1] It is described in some detail in both this stata bulletin about the population attributable fraction, and in these lecture slides.

In these descriptions, the sum of patients should be 975 (200 cases and 775 controls).

However, the esoph data:

library(datasets)
dat <- esoph
sum(dat$ncontrols) #=975
sum(dat$ncases) #= 200

This gives a total of 1175 records. I think an error has slipped in whereby controls is actually the total number of records (sum of cases and controls). If esoph$ncases is subtracted from esoph$ncontrols then this provides a true number of controls that matches the above descriptions.

One thing to add here, I am 99% sure that the esoph data comes from the Tuyns paper for a couple of reasons, but can't find the exact reference as I don't have access to the Breslow book quoted in the r documentation.

The reasons are:

  1. The number of combinations is exactly the same (88 combinations of age, alcohol status, tobacco status)

  2. The town of Ille-et Vilaine is mentioned in the r documentation of esoph

  3. Counts and features otherwise match precisely, and odds calculated on the corrected data match those provided by tutorials.

A solution to this would be to subtract ncases from ncontrols to provide a total number column, a cases column and a controls column.

Happy to open a PR to do this if helpful. I couldn't find a github repo for the core-R datasets code so thought best to open this issue here first as I see you have that dataset included.

[1] Tuyns, A. J., G. Pequignot, and O. M. Jensen. 1977. Le cancer de l’oesophage en Ille-et Vilaine en fonction des niveaux de consommation d’alcool et de tabac. Bulletin of Cancer 64: 45–60

add datasets with errors

I'll be a bit blunt here ;-).

The problem with all example datasets is that they are perfect: no missings, no outliers, no errors, no inconsistencies.

So no exercises in cleaning, which is what you do most of your time when you work with data.

Please add the SBS2000 data set from the 'validate' package to have something truly horrible in this list, and even people learning R for the first time can suffer from the terrible state that most data is in :-).

Cheers.
Mark

how to attach dates to the `EuStockMarkets` data set

Hi Vincent,

How do you attach real dates to the EuStockMarkets data set? I think that the mts column in the EuStockMarkets data set you are sharing is missing, which may help me to get to where I need to go. Would you add that back?

Thanks,

Package suggestion: Add wooldridge package

Hi - thanks for organizing this really cool resource!

I'd like to suggest to add datasets from the wooldridge package, which you can find on CRAN. If you'd like, I could also add a PR this or next weekend.

Best, Alex

Add fpp3 datasets

Rdatasets contains the fpp2 datasets (fpp2 = Forecasting Principles and Practice, 2nd edition). I would like to have the fpp3 datasets available on Rdatasets. (fpp3 is the 3rd edition). These datasets are available on CRAN (library(fpp3)). I am willing to help do this and submit a PR if you can point me to some documentation, scripts, etc for guidance. Thanks

Add datasets from heplots and vcdExtra?

I have two packages with many datasets useful for teaching & research

  • heplots - multivariate linear models, MANOVA, discriminant analysis, ..

  • vcdExtra -- categorical data analysis & visualization

These might be useful to add to Rdatasets -- what would it take?

Also: take a look at vcdExtra::datasets() -- it gives a nice summary of datasets in a package.

Minor discrepancies between subfolder csvs and master sheet

Hello @vincentarelbundock -- thanks so much for providing these data.

I did a very quick scan through the data and documentation for the same. In particular, I was looking for any discrepancies between this main sheet and the names of the data sets themselves (as in name.csv) stored in the subfolders.

Here are some that are found that appear in the data as csvs, but not documented on the sheet. This was very rough and ready, and I might have missed something, but just in case it's helpful for your sweeps --

"aldh2" "apoeapoc" "bomregions2011" "bomregions2012"
"bomsoi2001" "cf" "cnv" "crohn"
"Damian" "fa" "fsnps"
"head.injury" "hla" "inf1"
"jma.cojo" "l51" "lukas" "mao"
"meyer" "mfblong" "mr" "nep499"
"PD"

For example, bomregions2012.csv appears in the DAAG subfolder, but not on that master sheet. And indeed, it has documentation here.

Again, thanks for all this work!

Exporting CSV with parameters row.names = FALSE and na = "" to avoid NA strings and get rid of row.names column?

What a nice collection of datasets! Thanks!

I tried loading data using duckdb and noticed that some numerical columns contain "NA" strings in the exported CSV. Those load nicely in R but often not elsewhere. I also saw that the row.names column seems to be included.

For example:

$ duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
┌─────────┬─────────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬─────────┬───────┐
│ column0 │ species │  island   │ bill_length_mm │ … │ flipper_length_mm │ body_mass_g │   sex   │ year  │
│  int64  │ varchar │  varchar  │    varchar     │   │      varchar      │   varchar   │ varchar │ int64 │
├─────────┼─────────┼───────────┼────────────────┼───┼───────────────────┼─────────────┼─────────┼───────┤
│       1 │ Adelie  │ Torgersen │ 39.1           │ … │ 181               │ 3750        │ male    │  2007 │
│       2 │ Adelie  │ Torgersen │ 39.5           │ … │ 186               │ 3800        │ female  │  2007 │
│       3 │ Adelie  │ Torgersen │ 40.3           │ … │ 195               │ 3250        │ female  │  2007 │
│       4 │ Adelie  │ Torgersen │ NA             │ … │ NA                │ NA          │ NA      │  2007 │
│       5 │ Adelie  │ Torgersen │ 36.7           │ … │ 193               │ 3450        │ female  │  2007 │
├─────────┴─────────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴─────────┴───────┤
│ 5 rows                                                                             9 columns (8 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Compare the data types for columns bill_length_mm bill_depth_mm flipper_length_mm body_mass_g with the R dataset:

> palmerpenguins::penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4 Adelie  Torgersen           NA            NA                  NA          NA NA      2007

Could exporting using write.csv(x, na = "", row.names = FALSE) alleviate this issue? For example here.

Issues with some datasets

@kojix2 reported the following:


Hi @vincentarelbundock and Rdatasets developers!

Thank you for creating a very useful repository.

I found 3 CSV files that have a different name than the dataset.csv description.
kojix2/RDatasets#4

  • Stat2Data has two identical data sets. InfantMortality and InfantMortality2010. But it seems only InfantMortality2010 is described in datasets.csv.
  • Stat2Data/Leafhoppers.csv is described as LeafHoppers in dataset.csv
  • survival/genfan.csv does not exist in datasets.csv

Thanks.

Snow.polygons missing from HistData

First of all, thanks for your valuable resource, it makes it easy to point to students to datasets they can use to learn and experiment.

I was looking at the Snow.* datasets from HistData, and noticed that the Snow.polygons data is missing from the list.

Error within O-Rings dataset

Row 18 of the O-Rings dataset has a "Blowby" value of 2 but a "Total" value of only 1. Since "Total" equals the sum of "Blowby" and "Erosion" values, "Total" in this case should be 2 rather than 1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.