tidyverse / tibble Goto Github PK

View Code? Open in Web Editor NEW

647.0 34.0 131.0 66.87 MB

A modern re-imagining of the data frame

Home Page: https://tibble.tidyverse.org/

License: Other

R 96.72% C 3.01% Mermaid 0.27%

r tidy-data

tibble's Introduction

tibble

Overview

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

If you are new to tibbles, the best place to start is the tibbles chapter in R for data science.

Installation

# The easiest way to get tibble is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just tibble:
install.packages("tibble")

# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/tibble")

Usage

library(tibble)

Create a tibble from an existing object with as_tibble():

data <- data.frame(a = 1:3, b = letters[1:3], c = Sys.Date() - 1:3)
data
#>   a b          c
#> 1 1 a 2023-10-07
#> 2 2 b 2023-10-06
#> 3 3 c 2023-10-05

as_tibble(data)
#> # A tibble: 3 × 3
#>       a b     c         
#>   <int> <chr> <date>    
#> 1     1 a     2023-10-07
#> 2     2 b     2023-10-06
#> 3     3 c     2023-10-05

This will work for reasonable inputs that are already data.frames, lists, matrices, or tables.

You can also create a new tibble from column vectors with tibble():

tibble(x = 1:5, y = 1, z = x^2 + y)
#> # A tibble: 5 × 3
#>       x     y     z
#>   <int> <dbl> <dbl>
#> 1     1     1     2
#> 2     2     1     5
#> 3     3     1    10
#> 4     4     1    17
#> 5     5     1    26

tibble() does much less than data.frame(): it never changes the type of the inputs (e.g. it keeps list columns as is), it never changes the names of variables, it only recycles inputs of length 1, and it never creates row.names(). You can read more about these features in vignette("tibble").

You can define a tibble row-by-row with tribble():

tribble(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)
#> # A tibble: 2 × 3
#>   x         y     z
#>   <chr> <dbl> <dbl>
#> 1 a         2   3.6
#> 2 b         1   8.5

Related work

The tibble print method draws inspiration from data.table, and frame. Like data.table::data.table(), tibble() doesn’t change column names and doesn’t use rownames.

Code of Conduct

Please note that the tibble project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

tibble's People

Contributors

Stargazers

Watchers

Forkers

zhilongjia r2evans paulhendricks bhive01 ralic lionel- jankatins odeleongt anhqle xtmgah samuel-liyi richierocks xpingli atribe zeehio dholstius javierluraschi etiennebr t-kalinowski vspinu rlugojr yixf-self kieroneil krlmlr manishgodse milesmcbain niknakk huangleiabcde mhamine echasnovski barnettjacob sunilkumar87 sunnycd leeper imanuelcostigan colearendt ktargows luzhangaugust calculus-ask cderv mejihero bcc2xp nemochina2008 g-bruce jelowey jrnold mgirlich fomenkosmart lizl90 mundl kevinykuo sfirke bgreenwell alexwhan thomas-keller dalejbarr batpigandme qulogic jeffreyhanson enchufa2 maxheld83 brodieg ifferent mhdella makarevichy earowang jayhesselberth jimhester minghao2016 romainfrancois kevinushey takewiki zachary-foster armanabraham xvrdm davisvaughan pandengwang jeffreypullin titacool konradzdeb daviddalpiaz kmustzjq stufield bradleyboehmke gdequeiroz anabbott rbjanis jonocarroll stevenhibble han-tun pythseq sumansapkota0 shajoezhu datacamp vinhtantran samuelcsng daranzolin ripumpum sigfigcalculator topepo

tibble's Issues

[[ method for tbl_df doesn't work with i, j

Issue migrated from dplyr tidyverse/dplyr#1525. Updated to show the different error message produced by tibble.

BTW you're going to get a lot of issues about nibble if my auto-correct has its way.

This is handy for inspecting list-columns.

library(tibble)
x <- data_frame(a = 1:3, b = lapply(a, seq_len))
x[[3, 2]]
#> Error in .subset2(x, i): subscript out of bounds

Supply pull requests for other packages

Implement as_data_frame.default()

for tibble-unaware objects such as memisc::data.set.

as_data_frame.default <-
  function(x, ...) as_data_frame(as.data.frame(x, stringsAsFactors = FALSE, ....))

Add a cbind method for tbl_df?

I recently needed to do the equivalent of cbind(foo = 1:3, bar) where bar was a tbl_df, where I wanted foo to end up as the first column/variable in the resulting tbl_df.

The dplyr solutions suggested to me involved mutate() + select(..., everything()) or bind_cols(), which seems to void a reason for pulling the tibbles out of that package into this one.

Having this would be a nice usability boon for the pkg.

Formatting of S3 classes

Currently doesn't work (at least) if class wraps atomic type.

> tibble::data_frame(hms = hms(1:3))
Source: local data frame [3 x 1]

    hms
  <dbl>
1     1
2     2
3     3

Track data_frame updates in dplyr

FYI Hadley did some refactoring of data_frame() functions in dplyr yesterday.

See commits starting with tidyverse/dplyr@3d254d6

FYI I have been wanting to split these functions out as well, so 👍 from me for this package!

knit_print.trunc_mat is not declared as S3 method in NAMESPACE

Carried over from dplyr.

FR: Make a data frame from a (possibly named) vector or list

Here's something I do fairly often, mostly with a list, but sometimes with a vector: Initialize a data frame with that list or vector as a variable and, at the same time, promote its names to a proper variable. Or, perhaps, add a variable of row numbers. Why is it so important to add the names or row numbers? Because later you'll want to process with tidyr, i.e. with unnest() and/or spread().

I could point to some real uses if I need to really sell this. But hopefully this will just make sense. Or someone will tell me it's already easy to do? It is already easy, but perhaps worth making a function for.

library(tibble)

x <- list(alpha = 'horrible', beta = 'list', gamma = 'column')

## wish it were easy to make the names a proper variable
data_frame(id = names(x), thing = x)
#> Source: local data frame [3 x 2]
#> 
#>      id    thing
#>   (chr)   (list)
#> 1 alpha <chr[1]>
#> 2  beta <chr[1]>
#> 3 gamma <chr[1]>

## where id can easily default to row number
data_frame(id = seq_along(x), thing = x)
#> Source: local data frame [3 x 2]
#> 
#>      id    thing
#>   (int)   (list)
#> 1     1 <chr[1]>
#> 2     2 <chr[1]>
#> 3     3 <chr[1]>

Why is src() in this package?

nicer printing of list columns

Seems like we will have more exotic objects in tbl_dfs in the near future. This poses a printing challenge. And whatever RStudio is doing in View() seems like a good idea. Here are two views of a tbl_df that has a bunch of tweets in it, stored as S4 status objects from the twitteR package. Could the regular print method behave more like View() and show less, to reduce the risk of obscuring other variables? Somewhat related to a question I posed on R-help and SO earlier this year.

add_rownames enhancement

Fixes tidyverse/dplyr#1564.

@zhilongjia: This sounds reasonable, but I think there should be two separate functions. Would you like to contribute to this package?

Can create corrupt data frame with matrix indexing

bar <- data_frame(a = c("a", "b"))
foo <- bar[matrix(TRUE, nrow = 2, ncol = 1)]
foo

This is the root cause of tidyverse/dplyr#1798

Distinguish between "factor" and "ordered"

column type: <fctr> vs. <ord>.

Should enframe work more like bind_rows?

i.e.

x <- c(a = 1, b = 4, c = 10)
enframe(x)
enframe(x, .id = "name")

But I'm not sure how it would figure out the name of the first column. Maybe use the same principle as data_frame()?

enframe <- function(x, .name = deparse(substitute(x)), .id = NULL) {
  ...
}

(in the fullness of time that would use lazyeval:: expr_text() instead of deparse(substitute(x)))

base::getElement fails on tibbles

Fixes tidyverse/dplyr#1523.

Export matrixToDataFrame()

S3 dispatch on matrix class is not reliable, e.g., for factor matrices.

For tidyverse/tidyr#131.

Rename options

dplyr.print_min
dplyr.print_max
dplyr.width

See #3.

Provide rbind method

That uses dplyr::bind_rows().

Moved from tidyverse/dplyr#1385

Remove add_rownames()

The original version should remain in dplyr only, functions with the new naming convention don't touch the class of the object.

Idea: Limit height of trunc_mat() output

Contains table and extra information. The height of the table can be controlled precisely, but not the height of the extra information. (See #51 for updated output format.)

Specifically, the print_max option could be used as limit here: The new interpretation would be that at most 20 lines are printed, no matter what.

CC @lionel-.

Don't print ... if number of rows is NA on input but certain after calling head()

For SQL sources.

is.data_frame()

or is_data_frame()? Useful for testing.

should [[i, ]] be an error?

Shouldn't [[i, ]] be an error?

library(tibble)
iris[[ , 1]]
#> Error in `[[.data.frame`(iris, , 1): argument "..1" is missing, with no default
iris[[1, ]]
#> Error in `[[.data.frame`(iris, 1, ): argument "..2" is missing, with no default
as_data_frame(iris)[[ ,1]]
#> Error in `[[.tbl_df`(as_data_frame(iris), , 1): argument "i" is missing, with no default

Why does this "work"?

as_data_frame(iris)[[1, ]]
#> [1] 5.1

The plot thickens

library(tibble)
mtcars[["Lotus Europa", "mpg"]]
#> [1] 30.4

Why the message about a column?

as_data_frame(mtcars)[["Lotus Europa", "mpg"]]
#> Error: Unknown column 'Lotus Europa'

x[i, ] gives wrong results

> tibble::as_data_frame(iris)[1:5, ]
Source: local data frame [5 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

Can create 1d array variables

data_frame(x1 = array(1), x2 = rnorm(1))
#> Source: local data frame [1 x 2]
#> 
#>      x1        x2
#>   <dbl>     <dbl>
#> 1     1 0.8534246

I think this should be an error to be consistent with

data_frame(x1 = matrix(1, 1, 1), x2 = rnorm(1))

type_sum() for data frames

All methods should return a string with four or less characters, suitable for succinctly display column types.

Not quite what the current implementation does.

Migrate news from dplyr 0.4.3+

tidyverse/dplyr#1595 (comment)

Repair names and whitespace

Stripping spaces doesn't feel very consistent to me. Why strip only spaces and not other invisible white space characters? Why not strip other characters that are hard to type?

I think repair_names() would be better off if it focussed only on missing, blank, and duplicated column names.

Rethink inheritance from data.frame

A tibble is not quite a data frame: Some operators have slightly different semantics. There's as.data.frame() for coercion, which currently behaves oddly by stripping derived classes. (On the other hand, this is what as.data.frame.data.frame() does, too.)

Functions that check is.data.frame() would now fail. Generics dispatching over data.frame won't dispatch tibbles anymore. This is easy to fix both by the caller (by calling as.data.frame()), and also by the implementer (by coercing or defining a tbl_df generic, or a default generic that calls as.data.frame()).

No change for functions that use duck-typing and don't check/coerce input arguments.

Tibble vs tbl_df

If you're new to tibble/dplyr, it's a bit confusing to understand the difference between tibbles and tbl_df. To help reduce this confusion we might:

Add ?tibble and explanation the history & definition
Make obj_sum return "tibble" for tibbles (instead of "tbl_df")

Don't print number of rows if all rows printed

e.g.

> data_frame(x = 1:4)
Source: local data frame [4 x 1]

      x
  (int)
1     1
2     2
3     3
4     4

would be better as

> data_frame(x = 1:4)
      x
  (int)
1     1
2     2
3     3
4     4

This normally doesn't matter, but it's useful for books where space is at a premium

as_data_frame.tbl_df() should strip additional classes

Closes tidyverse/dplyr#1744.

CC @aphalo.

expand_grid function as trimmed down version of expand.grid (similar to data_frame)

data_frame is a trimmed down version of data.frame. A analogous expand_grid to replace expand.grid would be great...

Rethink exporting dim_desc()

Currently, it's exported because print.tbl_xxx() from dplyr and backends need to access it to print dimensions themselves. Perhaps printing dimensions, and the other information (source type, grouping, ...) should be responsibility of tibble, too.

Inconsistency: src() but make_tbl()

There's tbl() with different semantics, hence make_tbl(). Think about renaming src() to make_src().

Use fast version of matrixToDataFrame()

Not used anymore in dplyr, seems to use only basic Rcpp functionality.

tidyverse/tidyr#136 tidyverse/dplyr#1595

Change glimpse.tbl() to glimpse.tbl_df()

The implementation doesn't look like it works with non-data-frame sources.

Full test coverage

Add and use "width" argument to wrap()

For consistent output.

Provide as_data_frame method for tables

So you can easily take table(x) and turn it into a "nice" data frame.

Disallow row names in tibble?

Completely disallowing row names for tibbles is not an option anymore, this would break existing code. One thing we could do is to forbid setting row names on tibbles, or at least give a warning.

option to clean/normalize data.frames?

Fixes tidyverse/dplyr#1587.

@r2evans: Would you like to contribute to this package?

Use tibble.width as default width in glimpse()

Default argument NULL, means get width from options.

tibble() and zero-row data frames

> tibble(~a, ~b)
Error in dots[[i]] : subscript out of bounds

This should probably return an empty data frame.

CC @kevinushey

Change dplyr to tibble

Documentation
Option names

Columns labels?

I'm not sure whether this should be in tibble or if I should implement it in a separate package, with a class inheriting from tbl_df, but I'd be interested in the possibility to associate (longer) labels to column names/variables. That's useful for example for survey data where variables usually have a short name and a longer label (for example the wording of the question). These labels can be stored in an attribute of the data_table, but it could be useful to have methods both for attaching these labels and for retrieving/using them. What do you think?

Regression: Awkward output for zero-row tibbles

> data_frame(a=character())
Source: local data frame [0 x 1]

Variables
  not
  shown:
  a
  (chr).

NA printing

Fiddling with example related to tidyverse/readr#295, I realized that tbls don't indicate NAs very well. If this is intentional and some sort of 'least of all evils', just close this.

library(tibble)
(x <- frame_data(
  ~country, ~code,
  "Belize", "BZ",
  "Namibia", "NA",
  "Narnia", NA_character_
))
#> Source: local data frame [3 x 2]
#> 
#>   country  code
#>     <chr> <chr>
#> 1  Belize    BZ
#> 2 Namibia    NA
#> 3  Narnia    NA
as.data.frame(x)
#>   country code
#> 1  Belize   BZ
#> 2 Namibia   NA
#> 3  Narnia <NA>

Highlight significant digits

Moved from tidyverse/dplyr#897

I think the default display of tibbles could be improved if each column highlighted 3 (say) significant digits, by printing all other numbers in paler grey (in terminals that support colour). This makes tables of numbers easier to scan.

frame_data should create list-vars for non-scalar inputs

Fixes tidyverse/dplyr#1572.