ropensci / skimr Goto Github PK

A frictionless, pipeable approach to dealing with summary statistics

Home Page: https://docs.ropensci.org/skimr

R 9.40% HTML 90.26% Jupyter Notebook 0.34%

unconf17 r summary-statistics ropensci unconf r-package rstats peer-reviewed

skimr's Introduction

skimr

skimr provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skim_df object which can be included in a pipeline or displayed nicely for the human reader.

Note: skimr version 2 has major changes when skimr is used programmatically. Upgraders should review this document, the release notes and vignettes carefully.

Installation

The current released version of skimr can be installed from CRAN. If you wish to install the current build of the next release you can do so using the following:

# install.packages("devtools")
devtools::install_github("ropensci/skimr")

The APIs for this branch should be considered reasonably stable but still subject to change if an issue is discovered.

To install the version with the most recent changes that have not yet been incorporated in the main branch (and may not be):

devtools::install_github("ropensci/skimr", ref = "develop")

Do not rely on APIs from the develop branch, as they are likely to change.

Skim statistics in the console

skimr:

Provides a larger set of statistics than summary(), including missing, complete, n, and sd.
reports each data types separately
handles dates, logicals, and a variety of other types
supports spark-bar and spark-line based on the pillar package.

Separates variables by class:

skim(chickwts)

## ── Data Summary ────────────────────────
##                            Values  
## Name                       chickwts
## Number of rows             71      
## Number of columns          2       
## _______________________            
## Column type frequency:             
##   factor                   1       
##   numeric                  1       
## ________________________           
## Group variables            None    
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                        
## 1 feed                  0             1 FALSE          6 soy: 14, cas: 12, lin: 12, sun: 12
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd  p0  p25 p50  p75 p100 hist 
## 1 weight                0             1 261. 78.1 108 204. 258 324.  423 ▆▆▇▇▃

Presentation is in a compact horizontal format:

skim(iris)

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts               
## 1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

Built in support for strings, lists and other column classes

skim(dplyr::starwars)

## ── Data Summary ────────────────────────
##                            Values         
## Name                       dplyr::starwars
## Number of rows             87             
## Number of columns          14             
## _______________________                   
## Column type frequency:                    
##   character                8              
##   list                     3              
##   numeric                  3              
## ________________________                  
## Group variables            None           
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 name                  0         1       3  21     0       87          0
## 2 hair_color            5         0.943   4  13     0       12          0
## 3 skin_color            0         1       3  19     0       31          0
## 4 eye_color             0         1       3  13     0       15          0
## 5 sex                   4         0.954   4  14     0        4          0
## 6 gender                4         0.954   8   9     0        2          0
## 7 homeworld            10         0.885   4  14     0       48          0
## 8 species               4         0.954   3  14     0       37          0
## 
## ── Variable type: list ─────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate n_unique min_length max_length
## 1 films                 0             1       24          1          7
## 2 vehicles              0             1       11          0          2
## 3 starships             0             1       17          0          5
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd p0   p25 p50   p75 p100 hist 
## 1 height                6         0.931 174.   34.8 66 167   180 191    264 ▁▁▇▅▁
## 2 mass                 28         0.678  97.3 169.  15  55.6  79  84.5 1358 ▇▁▁▁▁
## 3 birth_year           44         0.494  87.6 155.   8  35    52  72    896 ▇▁▁▁▁

Has a useful summary function

skim(iris) %>%
  summary()

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None

Individual columns can be selected using tidyverse-style selectors

skim(iris, Sepal.Length, Petal.Length)

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   numeric                  2     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂

Handles grouped data

skim() can handle data that has been grouped using dplyr::group_by().

iris %>%
  dplyr::group_by(Species) %>%
  skim()

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   numeric                  4         
## ________________________             
## Group variables            Species   
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##    skim_variable Species    n_missing complete_rate  mean    sd  p0  p25  p50  p75 p100 hist 
##  1 Sepal.Length  setosa             0             1 5.01  0.352 4.3 4.8  5    5.2   5.8 ▃▃▇▅▁
##  2 Sepal.Length  versicolor         0             1 5.94  0.516 4.9 5.6  5.9  6.3   7   ▂▇▆▃▃
##  3 Sepal.Length  virginica          0             1 6.59  0.636 4.9 6.22 6.5  6.9   7.9 ▁▃▇▃▂
##  4 Sepal.Width   setosa             0             1 3.43  0.379 2.3 3.2  3.4  3.68  4.4 ▁▃▇▅▂
##  5 Sepal.Width   versicolor         0             1 2.77  0.314 2   2.52 2.8  3     3.4 ▁▅▆▇▂
##  6 Sepal.Width   virginica          0             1 2.97  0.322 2.2 2.8  3    3.18  3.8 ▂▆▇▅▁
##  7 Petal.Length  setosa             0             1 1.46  0.174 1   1.4  1.5  1.58  1.9 ▁▃▇▃▁
##  8 Petal.Length  versicolor         0             1 4.26  0.470 3   4    4.35 4.6   5.1 ▂▂▇▇▆
##  9 Petal.Length  virginica          0             1 5.55  0.552 4.5 5.1  5.55 5.88  6.9 ▃▇▇▃▂
## 10 Petal.Width   setosa             0             1 0.246 0.105 0.1 0.2  0.2  0.3   0.6 ▇▂▂▁▁
## 11 Petal.Width   versicolor         0             1 1.33  0.198 1   1.2  1.3  1.5   1.8 ▅▇▃▆▁
## 12 Petal.Width   virginica          0             1 2.03  0.275 1.4 1.8  2    2.3   2.5 ▂▇▆▅▇

Behaves nicely in pipelines

iris %>%
  skim() %>%
  dplyr::filter(numeric.sd > 1)

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   numeric                  1         
## ________________________             
## Group variables            None      
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd p0 p25  p50 p75 p100 hist 
## 1 Petal.Length          0             1 3.76 1.77  1 1.6 4.35 5.1  6.9 ▇▁▆▇▂

Knitted results

Simply skimming a data frame will produce the horizontal print layout shown above. We provide a knit_print method for the types of objects in this package so that similar results are produced in documents. To use this, make sure the skimmed object is the last item in your code chunk.

faithful %>%
  skim()

Data summary

Name	Piped data
Number of rows	272
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Data summary

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
eruptions	0	1	3.49	1.14	1.6	2.16	4	4.45	5.1	▇▂▂▇▇
waiting	0	1	70.90	13.59	43.0	58.00	76	82.00	96.0	▃▃▂▇▂

Customizing skimr

Although skimr provides opinionated defaults, it is highly customizable. Users can specify their own statistics, change the formatting of results, create statistics for new classes and develop skimmers for data structures that are not data frames.

Specify your own statistics and classes

Users can specify their own statistics using a list combined with the skim_with() function factory. skim_with() returns a new skim function that can be called on your data. You can use this factory to produce summaries for any type of column within your data.

Assignment within a call to skim_with() relies on a helper function, sfl or skimr function list. By default, functions in the sfl call are appended to the default skimmers, and names are automatically generated as well.

my_skim <- skim_with(numeric = sfl(mad))
my_skim(iris, Sepal.Length)

But you can also helpers from the tidyverse to create new anonymous functions that set particular function arguments. The behavior is the same as in purrr or dplyr, with both . and .x as acceptable pronouns. Setting the append = FALSE argument uses only those functions that you’ve provided.

my_skim <- skim_with(
  numeric = sfl(
    iqr = IQR,
    p01 = ~ quantile(.x, probs = .01)
    p99 = ~ quantile(., probs = .99)
  ),
  append = FALSE
)
my_skim(iris, Sepal.Length)

And you can remove default skimmers by setting them to NULL.

my_skim <- skim_with(numeric = sfl(hist = NULL))
my_skim(iris, Sepal.Length)

Skimming other objects

skimr has summary functions for the following types of data by default:

numeric (which includes both double and integer)
character
factor
logical
complex
Date
POSIXct
ts
AsIs

skimr also provides a small API for writing packages that provide their own default summary functions for data types not covered above. It relies on R S3 methods for the get_skimmers function. This function should return a sfl, similar to customization within skim_with(), but you should also provide a value for the class argument. Here’s an example.

get_skimmers.my_data_type <- function(column) {
  sfl(
    .class = "my_data_type",
    p99 = quantile(., probs = .99)
  )
}

Limitations of current version

We are aware that there are issues with rendering the inline histograms and line charts in various contexts, some of which are described below.

Support for spark histograms

There are known issues with printing the spark-histogram characters when printing a data frame. For example, "▂▅▇" is printed as "<U+2582><U+2585><U+2587>". This longstanding problem originates in the low-level code for printing dataframes. While some cases have been addressed, there are, for example, reports of this issue in Emacs ESS. While this is a deep issue, there is ongoing work to address it in base R.

This means that while skimr can render the histograms to the console and in RMarkdown documents, it cannot in other circumstances. This includes:

converting a skimr data frame to a vanilla R data frame, but tibbles render correctly
in the context of rendering to a pdf using an engine that does not support utf-8.

One workaround for showing these characters in Windows is to set the CTYPE part of your locale to Chinese/Japanese/Korean with Sys.setlocale("LC_CTYPE", "Chinese"). The helper function fix_windows_histograms() does this for you.

And last but not least, we provide skim_without_charts() as a fallback. This makes it easy to still get summaries of your data, even if unicode issues continue.

Printing spark histograms and line graphs in knitted documents

Spark-bar and spark-line work in the console, but may not work when you knit them to a specific document format. The same session that produces a correctly rendered HTML document may produce an incorrectly rendered PDF, for example. This issue can generally be addressed by changing fonts to one with good building block (for histograms) and Braille support (for line graphs). For example, the open font “DejaVu Sans” from the extrafont package supports these. You may also want to try wrapping your results in knitr::kable(). Please see the vignette on using fonts for details.

Displays in documents of different types will vary. For example, one user found that the font “Yu Gothic UI Semilight” produced consistent results for Microsoft Word and Libre Office Write.

Inspirations

TextPlots for use of Braille characters
spark for use of block characters.

The earliest use of unicode characters to generate sparklines appears to be from 2009.

Exercising these ideas to their fullest requires a font with good support for block drawing characters. PragamataPro is one such font.

Contributing

We welcome issue reports and pull requests, including potentially adding support for commonly used variable classes. However, in general, we encourage users to take advantage of skimr’s flexibility to add their own customized classes. Please see the contributing and conduct documents.

skimr's People

Contributors

Stargazers

Watchers

Forkers

thewiremonkey ruialv gshotwell titttima guhjy tjmahr mdsumner applied-statistic-using-r fxcebx njtierney tudeschini marcmtk ldroc xkdog jeonghyunwoo kralljr paulklemm analyzethat connorkirk 808sandbr bookman900 nanaakwasiabayieboateng jimhester lbusett jeremyrcoyle harrismcgehee jenniferthompson martinmodrak ismayc nemochina2008 alexilliamson mhamine pkq gridl jacob-long ktaranov dcomtois kassambara cerebralmastication khailper muntasirmasum petr0vsk chpmoreno lionel- stacybri svraka erictleung anuragsinghchaudhary markroepke xtmgah dengyq365 bastiantorges pherephobia rsaporta jabortell moutikabdessabour melissa-wong krlmlr drshaneburke romainfrancois imarin79 hms1 geneticresources nischalshrestha stjordanis hadley doh-bms2303 davisvaughan michaelchirico jameshwade edson-github alejoelpaisa johnxu2013 hhayman cmr902296

skimr's Issues

colformats vs colformat

Shouldn't "colformats" be "colformat" (twice) in README.md?

Deal with significant digits

Good reminder https://twitter.com/EdwardTufte/status/871049024048115713

We should think about how to handle digits and also when integers should be returned.

Buildignore needs to be updated

Make histogram optional

@haozhu233 and I did a bit of benchmarking of skim() and it looks like there are some performance issues with drawing the histogram. This is evident on large grouped data frames. We might want allow the user to not draw the histograms if they are interested in speedier skimming.

Duplicate statistic names should not be allowed

Using skim_with() someone can make multiple statistics with the same names. Should we prevent that?

Tall or nested?

It might be worth considering if a list-column might be slightly more flexible.

tribble(
  ~ var, ~ summary, ~ value,
  "cyl", "mean",    3.5,
  "cyl", "median",  3,
  "cyl", "sd",      2.75

)
# vs
tribble(
  ~ var, ~ summary,
  "cyl", list(mean = 3.5, median = 3, sd = 2.75)
)

Supply a "tee" version

skim_tee <- function(x) {
  print(skim(x))
  invisible()
}

So you can verify the distribution multiple times inside a pipeline.

(This should be a separate function not an argument in order to be type stable)

n_unique() doesn't match documentation

The documentation for n_unique() says it returns the number of unique values but currently it returns the vector of unique values.

Default vector method

Right now, skim(mtcars$mpg) fails with Error in UseMethod("skim") : no applicable method for 'skim' applied to an object of class "c('double', 'numeric')". skim_v() solves the issue but we should do something better by default. Better error message? Use skim_v()?

add group_by functionality?

as discussed. and mentioned on twitter.

Vignettes needed

We could use some good vignettes of both simple and advanced use.

Ordered factors cause an error

skmir chokes on ordered factors.

library(tidyverse)
library(skimr)

df <- data_frame(x = rnorm(100),
                 y = rnorm(100),
                 z = factor(sample(LETTERS[1:5], 100, replace = TRUE)))
skim(df)
#> Numeric Variables
#> # A tibble: 2 x 13
#>     var    type missing complete     n        mean       sd       min
#>   <chr>   <chr>   <dbl>    <dbl> <dbl>       <dbl>    <dbl>     <dbl>
#> 1     x numeric       0      100   100 -0.01725644 1.065178 -2.477188
#> 2     y numeric       0      100   100 -0.02650740 1.041577 -2.259213
#> # ... with 5 more variables: `25% quantile` <dbl>, median <dbl>, `75%
#> #   quantile` <dbl>, max <dbl>, hist <chr>
#> 
#> Factor Variables
#> # A tibble: 1 x 7
#>     var   type complete missing     n n_unique
#>   <chr>  <chr>    <dbl>   <dbl> <dbl>    <dbl>
#> 1     z factor      100       0   100        5
#> # ... with 1 more variables: stat <chr>

df1 <- df %>%
  mutate(z = factor(z, ordered=TRUE))
skim(df1)
#> Error in .summary_functions[[type]]: wrong arguments for subsetting an environment

failed in loadNamespace()

Hi，
I get this error.

> kimr(mtcars)
Error in kimr(mtcars) : could not find function "kimr"
> skim(mtcars)
Error: .onLoad failed in loadNamespace() for 'crayon', details:
  call: NULL
  error: 'hasColorConsole' is not an exported object from 'namespace:rstudioapi'
>

R version

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

thanks

Failing test

The test-skim.R test is failing because there is no row for the inline histogram. But I'm not sure how to add that in the correct listing. Tried a few ideas but no success.

skimr pkg - same error message for any dataframe

Installed colformat and skimr pkgs,
as indicated in GitHub pg. Ok!

skim(chickwts) # or any other data frame (ie: mtcars, iris...)
returns this message:
"Error in overscope_eval_next(overscope, expr) : object 'level' not found"

Using:

latest Rstudio v.1.0.143
R 3.4.0 (latest R)
Ubuntu Linux (32 bit)
dplyr 0.5.0 - installed from tidyverse dev pkg version: 1.1.1.9000, (not from CRAN).
tibble 1.3.1
...and all other latest pkgs.

Thanks for any guidance-
...skimr looks VERY useful, eager to use it in Rstudio / Linux :-)

fail to install on mac

Hi
I was unable to install skimr on mac using below command.
install_github("ropenscilabs/skimr")

Thanks

Easy suggestion to fit skimr output as 1 line per var , (without line "wrap-around")

Hi Elinw -
skimr working great! (Rstudio / Ubuntu Linux 32 bits).

An easy to implement suggestion
to save precious screen real estate
and make the skimr output
more readable in smaller screens.

In the top title line of skim,
please shorten the names of some of the title text...

Specifically:

25% quantile to simply: Q1
75% quantile to simply: Q3
missing to simply: miss (or NA)
complete to simply: compl.

Just these 4 easy text changes,
will avoid the "wrap around" of each variable line
in smaller monitor screens.
= much easier to read (every var is contained in one line...).

Values for each var
will then fit much better within a single screen line...

Thanks Elinw :-)
Really appreciate your effort!

Cannot get histograms to show up

Hello,

I cannot get the histogram to show up in my console (RStudio) when I run some of the example code on the GitHub page:

The following code:

# install.packages("devtools")
devtools::install_github("hadley/colformat")
devtools::install_github("ropenscilabs/skimr")

library(tidyverse)
library(colformat)
library(skimr)

skim(mtcars) %>% filter(stat=="hist")

The following are the results:

# A tibble: 11 x 5
     var    type  stat                                                                            level value
   <chr>   <chr> <chr>                                                                            <chr> <dbl>
 1   mpg numeric  hist <U+2582><U+2585><U+2587><U+2587><U+2587><U+2583><U+2581><U+2581><U+2582><U+2582>     0
 2   cyl numeric  hist <U+2586><U+2581><U+2581><U+2581><U+2583><U+2581><U+2581><U+2581><U+2581><U+2587>     0
 3  disp numeric  hist <U+2587><U+2587><U+2585><U+2581><U+2581><U+2587><U+2583><U+2582><U+2581><U+2583>     0
 4    hp numeric  hist <U+2586><U+2586><U+2587><U+2582><U+2587><U+2582><U+2583><U+2581><U+2581><U+2581>     0
 5  drat numeric  hist <U+2583><U+2587><U+2582><U+2582><U+2583><U+2586><U+2585><U+2581><U+2581><U+2581>     0
 6    wt numeric  hist <U+2582><U+2582><U+2582><U+2582><U+2587><U+2586><U+2581><U+2581><U+2581><U+2582>     0
 7  qsec numeric  hist <U+2582><U+2583><U+2587><U+2587><U+2587><U+2585><U+2585><U+2581><U+2581><U+2581>     0
 8    vs numeric  hist <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2586>     0
 9    am numeric  hist <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2586>     0
10  gear numeric  hist <U+2587><U+2581><U+2581><U+2581><U+2586><U+2581><U+2581><U+2581><U+2581><U+2582>     0
11  carb numeric  hist <U+2586><U+2587><U+2582><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581>     0

I get similar results for the other examples on the GitHub page.

Session Info

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.5.0          purrr_0.2.2.2        readr_1.1.1          tidyr_0.6.3          tibble_1.3.3         ggplot2_2.2.1        tidyverse_1.1.1      skimr_1.0            colformat_0.0.0.9000

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11      cellranger_1.1.0  plyr_1.8.4        forcats_0.2.0     tools_3.3.3       digest_0.6.12     lubridate_1.6.0   jsonlite_1.4      memoise_1.1.0     nlme_3.1-131     
[11] gtable_0.2.0      lattice_0.20-35   rlang_0.1.1       psych_1.7.5       DBI_0.6-1         rstudioapi_0.6    parallel_3.3.3    haven_1.0.0       xml2_1.1.1        httr_1.2.1       
[21] withr_1.0.2       stringr_1.2.0     hms_0.3           devtools_1.13.1   grid_3.3.3        R6_2.2.1          readxl_1.0.0      foreign_0.8-68    modelr_0.1.0      reshape2_1.4.2   
[31] magrittr_1.5      scales_0.4.1.9000 rvest_0.3.2       assertthat_0.2.0  mnormt_1.5-5      colorspace_1.3-2  stringi_1.1.5     lazyeval_0.2.0    munsell_0.4.3     broom_0.4.2      
[41] crayon_1.3.2.9000

Fix documentation for get_fun_names()

https://github.com/ropenscilabs/skimr/blob/master/R/functions.R#L198

The documentation is the same as get_funs().

sk_print does not work for all classes

sk_print currently only handles numeric, character and factor. As a result ordered factors, dates and complex are not printing.

Deal with lists in columns?

How should we handle summarizing lists in columns?

get_funs doesn't work with multiple classes

get_funs() doesn't work when there are multiple classes.
The function works fine if you use type[1] directly but if I use skim it throws

Error in .summary_functions[[type]] :
wrong arguments for subsetting an environment

I think it's working like a hash and maybe needs %in% ?

weird output formatting

Hi there,
trying to work with the package, I am getting this result:

I guess there is an easy and straightforward solution, but nothing works so far (changing encoding, reinstalling packages). Maybe you have seen something similar.
Thanks!

Error on nycflights13::flights dataset

To repoduce:

nycflights13::weather %>% skim()
nycflights13::fights %>% skim()

I get this error:


Error in .summary_functions[[type]] : 
  wrong arguments for subsetting an environment 

#Callstack
13. get_funs(FUNS) at skim_v.R#19
12. .f(.x[[i]], ...) 
11. purrr::map(.data, skim_v) at skim.R#16
10. skim.data.frame(.) at skim.R#10
9. skim(.) 
8. function_list[[k]](value) 
7. withVisible(function_list[[k]](value)) 
6. freduce(value, `_function_list`) 
5. `_fseq`(`_lhs`) 
4. eval(expr, envir, enclos) 
3. eval(quote(`_fseq`(`_lhs`)), env, env) 
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env)) 
1. nycflights13::weather %>% skim()

skimr not working (similar to #56)

> skim(iris) Error: .onLoad failed in loadNamespace() for 'crayon', details: call: NULL error: 'hasColorConsole' is not an exported object from 'namespace:rstudioapi'

Latest daily build of Rstudio, latest R, other packages. OS X 10.11.6

notify users for empty values

For "" or " ", notify users that they might need to do a data cleaning before summarizing.

Deal with complex numbers

What summary statistics do people who work with complex numbers want? Is mean() meaningful?

Make decimals line up in print methods

Ideally the decimals should line up in a way similar to that in lucid.
https://cran.r-project.org/web/packages/lucid/vignettes/lucid_printing.html

Error in Breaks not unique

mtcars %>% 
  filter(cyl == 8) %>% 
  skim()

Produces the following error: Error in cut.default(x, 10) : 'breaks' are not unique Looks like it's caused by line 33 in Stats.R, and is caused by the vs variable, which is all zeroes.

Improve error messages

Gerring error....

Error in nchar(x) : invalid multibyte string, element 11

Is it possible to return the name of the column that is causing the issue?

Not all data types can be coerced to numeric

When skim encounters an unknown data type it attempts to coerce to numeric and do a set of default functions for numeric.
However, some data types, such as lists, cannot be coerced to numeric. In that case the following error is returned:

Error: (list) object cannot be coerced to type 'double'

Perhaps it would be good to use character as a fallback rather than numeric.

Option to spark-histograms = FALSE

Hello,

I love the layout of the skimr output. I have totally replaced the use of the summary function with skim(). With that said, knowing that the spark-histograms don't generate properly in Windows, is there any way to add an option to make it FALSE so that it does not print out. That would be great, and I think it would go faster. I am an R user not an R programmer, otherwise, I would submitted a pull request :).

Thank you,

Alfredo

Tests needed

We have pretty good coverage so far
https://codecov.io/gh/ropenscilabs/skimr/tree/master/R

but it would be great to get to 100% (or close to it). Most of them are easy tests for the individual functions in the stats.R file.
We could also use more tests of things like a column that is entirely NA.

Spark Histogram are rendered as symbols in html/md

Although the following code displays the histograms properly when I run the chunk in rmd, it turns into symbols in the rendered html or the md.

library(tidyverse)
library(skimr)

Sys.setlocale("LC_CTYPE", "Chinese")

skim(storms) %>% filter(stat=="hist")

# A tibble: 10 x 5
           var    type  stat      level value
         <chr>   <chr> <chr>      <chr> <dbl>
 1        year numeric  hist ¨z¨z¨z¨}¨~¨}¨~¨~¨~¨}     0
 2       month numeric  hist ¨x¨x¨x¨x¨x¨y¨|¨~¨z¨x     0
 3         day integer  hist ¨~¨}¨}¨}¨}¨}¨}¨}¨}¨}     0
 4        hour numeric  hist ¨~¨x¨~¨x¨x¨~¨x¨~¨x¨x     0
 5         lat numeric  hist ¨y¨~¨~¨}¨~¨~¨|¨y¨x¨x     0
 6        long numeric  hist ¨x¨|¨~¨~¨~¨}¨}¨z¨x¨x     0
 7        wind integer  hist ¨y¨~¨|¨z¨y¨y¨x¨x¨x¨x     0
 8    pressure integer  hist ¨x¨x¨x¨x¨x¨x¨y¨z¨~¨y     0
 9 ts_diameter numeric  hist ¨~¨~¨|¨y¨x¨x¨x¨x¨x¨x     0
10 hu_diameter numeric  hist ¨~¨x¨x¨x¨x¨x¨x¨x¨x¨x     0

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252   LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=English_United States.1252  LC_NUMERIC=C                           
[5] LC_TIME=English_United States.1252     

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2    skimr_0.9000    dplyr_0.7.3     purrr_0.2.3     readr_1.1.1     tidyr_0.7.1     tibble_1.3.4   
[8] ggplot2_2.2.1   tidyverse_1.1.1

loaded via a namespace (and not attached):
 [1] colformat_0.0.0.9000 tidyselect_0.2.0     reshape2_1.4.2       haven_1.0.0          lattice_0.20-35     
 [6] colorspace_1.3-2     htmltools_0.3.6      yaml_2.1.14          rlang_0.1.2          foreign_0.8-67      
[11] glue_1.1.1           modelr_0.1.0         readxl_1.0.0         bindr_0.1            plyr_1.8.4          
[16] stringr_1.2.0        munsell_0.4.3        blogdown_0.0.42      gtable_0.2.0         cellranger_1.1.0    
[21] rvest_0.3.2          psych_1.7.5          evaluate_0.10        knitr_1.16           forcats_0.2.0       
[26] gapminder_0.2.0      parallel_3.4.0       broom_0.4.2          Rcpp_0.12.12         scales_0.4.1        
[31] backports_1.1.0      jsonlite_1.5         mnormt_1.5-5         hms_0.3              digest_0.6.12       
[36] stringi_1.1.5        bookdown_0.4         grid_3.4.0           rprojroot_1.2        tools_3.4.0         
[41] magrittr_1.5         lazyeval_0.2.0       crayon_1.3.2.9000    pkgconfig_2.0.1      rsconnect_0.8       
[46] xml2_1.1.1           lubridate_1.6.0      assertthat_0.2.0     rmarkdown_1.6        httr_1.2.1          
[51] rstudioapi_0.6       R6_2.2.2             nlme_3.1-131         compiler_3.4.0

Need a function to get the current list of functions for a type

We need a simple function to return the current list for a type, both because people want to know without reading the code but also for selectively dropping functions.

skim of `sf` objects

The sf package is the R implementation of Simple Features and starts to be a new standard for working with spatial data in R. More information at https://github.com/edzer/sfr and http://robinlovelace.net/geocompr/spatial-class.html.

The most important element of this package is the sf class. It is a simple data.frame with a one, additional list-column, which store a geometry of the data.

I think it would be useful to add an ability of creating a summary of sf objects. A summary of the geometry column could return some basic informations, such as projection, geometry type, etc.

library(sf)
library(skimr)

nc = st_read(system.file("shape/nc.shp", package="sf"))
nc
nc %>% skim()

Error in .f(.x[[i]], ...) : 
  (list) object cannot be coerced to type 'double'
In addition: Warning message:
Skim does not know how to summarize of vector of class: sfc_MULTIPOLYGON. Coercing to numericSkim does not know how to summarize of vector of class: sfc. Coercing to numeric

Support `group_by`

Sometimes it is useful to report statistics based on groups, ie., what's the recovery rate in the experimental group compared to the control group.

In dplyr/tidyverse, building groups is left to group_by. However, it appears that this feature is not (yet) supported by skimr.

I would expect that this code yields a grouped dataframe, as other tidyverse-code does:
mtcars %>% group_by(cyl) %>% skim

However, the code does not split up the results in groups.

It would be great to have that feature. Many thanks for the great work1 👍

latex output

in latex output "hist" is coming as boxes, screen-shots of output

stat=="hist" does not show

Hi, I tried the skim function with the piped filter, but the histogram is not showed:

My R version:

Thanks for nice idea!
Simon

support for time-based variables planned?

Hello there,

thanks for this promising package. I wonder if you plan to add support for time based variables such as dates, timestamps, etc. The same way Pandas does it: that is showing minimum/maximum date, frequency, etc.

That would be extremely useful!
Thanks!

Support select verbs within the function?

One approach to writing skim piplines keeps us away from having to reimplement dplyr tools. For example:

select(mtcars, cyl) %>%
  skim()

Alternatively, we might be interested in allowing for column selection within skim().

skim(mtcars, cyl)

The latter approach gets us closer to the API listed in Amelia's original issue.

Alternative summary functions

Skim is designed to provide the most useful defaults to a user, given a set of data types. We've mentioned the possibility of allowing users to provide their sets of summary functions. This would be a stretch version of our work.

`skim` doesn't work with custom numeric function

Here's a simple reproducible example:

library(dplyr)
library(skimr)

skim_with(numeric=list(mn=purrr::partial(mean, na.rm=TRUE)), append=FALSE)
iris %>% skim

yields:

Error in enc2utf8(col_names(col_labels, sep = sep)) :
  argument is not a character vector

Something is skim_print.R seems to be interfering with this working properly. Does format_num rely on the default function being there?

Dependency `colformat` package changed name to `pillar`

The package cannot be installed anymore as the colformat repo doesn't exist anymore and is replaced by https://github.com/hadley/pillar.

See this commit: r-lib/pillar@831aade

Error in .summary_functions[[type]] : wrong arguments for subsetting an environment

Time series data, unbalanced

str(data_complete_raw)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2448 obs. of 37 variables:
$ iso : chr "AUS" "AUS" "AUS" "AUS" ...
$ Country : chr "Australia" "Australia" "Australia" "Australia" ...
$ Year : POSIXct, format: "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01" ...
$ before.indepence : num 0 0 0 0 0 0 0 0 0 0 ...
$ currency.crises : num 0 0 0 0 0 0 0 0 0 0 ...
$ inflation.crises : num 0 0 0 0 0 0 0 0 0 0 ...
$ stock.crash : num 0 0 0 0 0 0 0 0 0 0 ...
$ sov.debt.crises.dom: num 0 0 0 0 0 0 0 0 0 0 ...

Examples needed

Most of our functions don't have examples in the documentation.

Recursive skimming

How should we skim the object produced by skim?

Organizing the output of the function

The example from precis is organized a lot like the output of str() or dplyr::glimpse(). We don't have to adhere to this format if we don't want to.

If we have lots of variables, do we want grouped output from print.skim(). Here is one suggestion:

# Skim of a My data frame
# META Stats Nvariables N obs

## Numeric Variables
#> x missing: mean: median: sd: ...

## Categorical (factors or character vectors? Separate?)
#> c missing: level_a: level_b: ...

Output of skim.grouped_df()

There's an open question of what the skim output should be for grouped dataframes. In my view we should match the dplyr::summarize() behaviour and display the grouping variables in the leading columns and preserve the grouping values in the skim_df. Currently I have the function behaving like this:

mtcars %>% 
  group_by(cyl, gear) %>% 
  skim() %>% 
  .[1:10,] %>% 
  knitr::kable()

cyl	gear	var	type	stat	level	value
6	4	mpg	numeric	missing	.all	0.000000
6	4	mpg	numeric	complete	.all	4.000000
6	4	mpg	numeric	n	.all	4.000000
6	4	mpg	numeric	mean	.all	19.750000
6	4	mpg	numeric	sd	.all	1.552418
6	4	mpg	numeric	min	.all	17.800000
6	4	mpg	numeric	median	.all	20.100000
6	4	mpg	numeric	quantile	25%	18.850000
6	4	mpg	numeric	quantile	75%	21.000000
6	4	mpg	numeric	max	.all	21.000000

Better RMarkdown functionality

Big fan of the package. However, it is awkward to use when doing analyses in RMarkdown.

Rather than appearing as a single concise output, like glimpse, the results manifest as multiple separate outputs, a console that is blank other than Numeric Variables and Character Variables and an html tbl_df output for every variable type in the data_frame.

Furthermore, the tables often don't show all of the variables at once, which makes using skim difficult as well.