The goal of tidycode is to allow users to analyze R expressions in a tidy way.


You can install tidycode from CRAN with:


You can install the development version of tidycode from github with:

# install.packages("remotes")


Read in existing code

Using the matahari package, we can read in existing code, either as a string or a file, and turn it into a matahari tibble using matahari::dance_recital().

code <- "
m <- lm(mpg ~ am, data = mtcars)
t <- tidy(m)
glue_data(t, 'The point estimate for term {term} is {estimate}.')

m <- matahari::dance_recital(code)

Alternatively, you may already have a matahari tibble that was recorded during an R session.

Load the tidycode library.


We can use the expressions from this matahari tibble to extract the names of the packages included. We can also create a data frame that will include all functions of the packages included.

(pkg_names <- ls_packages(m$expr))
#> [1] "broom" "glue"
pkg_functions <- get_package_functions(m$expr)

Create a data frame of your expressions, splitting each into individual functions.

u <- unnest_calls(m, expr)

Merge in the package names

u <- u %>%
  dplyr::left_join(pkg_functions) %>%
  dplyr::select(func, args, line, package)
#> Joining, by = "func"
#> # A tibble: 8 x 4
#>   func      args              line package
#>   <chr>     <list>           <int> <chr>  
#> 1 library   <list [1]>           1 base   
#> 2 library   <list [1]>           2 base   
#> 3 <-        <list [2]>           3 base   
#> 4 lm        <named list [2]>     3 stats  
#> 5 ~         <list [2]>           3 base   
#> 6 <-        <list [2]>           4 base   
#> 7 tidy      <list [1]>           4 broom  
#> 8 glue_data <list [2]>           5 glue

Add in the function classifications!

u %>%
    get_classifications("crowdsource", include_duplicates = FALSE)
#> Joining, by = "func"
#> # A tibble: 8 x 5
#>   func      args              line package classification
#>   <chr>     <list>           <int> <chr>   <chr>         
#> 1 library   <list [1]>           1 base    setup         
#> 2 library   <list [1]>           2 base    setup         
#> 3 <-        <list [2]>           3 base    data cleaning 
#> 4 lm        <named list [2]>     3 stats   modeling      
#> 5 ~         <list [2]>           3 base    modeling      
#> 6 <-        <list [2]>           4 base    data cleaning 
#> 7 tidy      <list [1]>           4 broom   modeling      
#> 8 glue_data <list [2]>           5 glue    communication

We can also remove a list of “stopwords”. We have a function, get_stopfuncs() that lists common “stopwords”, frequently used operators, like %>% and +.

u %>%
    get_classifications("crowdsource", include_duplicates = FALSE)
    ) %>%
  dplyr::anti_join(get_stopfuncs()) %>%
  dplyr::select(func, classification)
#> Joining, by = "func"
#> Joining, by = "func"
#> # A tibble: 5 x 2
#>   func      classification
#>   <chr>     <chr>         
#> 1 library   setup         
#> 2 library   setup         
#> 3 lm        modeling      
#> 4 tidy      modeling      
#> 5 glue_data communication

tidycode's Issues

Functional programming: symbols vs function calls


I'm using tidycode to analyze students' code, and I wondered about something when looking at what follows:

> "purrr::map_dbl(mtcars, mean)" %>%
  dance_recital() %>%
# A tibble: 1 x 7
  value      error  output    warnings  messages  func    args      
  <list>     <list> <list>    <list>    <list>    <chr>   <list>    
1 <dbl [11]> <NULL> <chr [1]> <chr [0]> <chr [0]> map_dbl <list [2]>

I guess that the behavior above (i.e., the call to mean is not detected) is closely related to the fact that getParseData(parse(text = "map_dbl(mtcars, mean)")) detects mean as a SYMBOL.

The annoying thing is that, by using functionals, students can "hide" function calls. For instance, if I tell them to create a my_factorial function that does not call R's factorial but rather computes the factorial recursively, they can "cheat" and simply do my_factorial <- function(x) purrr::map_dbl(x, factorial).

> body(my_factorial) %>% deparse() %>% dance_recital() %>% unnest_calls(expr)
# A tibble: 3 x 7
  value  error      output warnings messages func    args      
  <list> <list>     <list> <list>   <list>   <chr>   <list>    
1 <NULL> <smplErrr> <NULL> <NULL>   <NULL>   ::      <list [2]>
2 <NULL> <smplErrr> <NULL> <NULL>   <NULL>   purrr   <list [2]>
3 <NULL> <smplErrr> <NULL> <NULL>   <NULL>   map_dbl <list [2]>

Right now, I prevent this by brute forcing the code analysis (i.e., I use stringr::str_detect ), but I find this solution somewhat unpleasant...

> body(my_factorial) %>% deparse() %>% stringr::str_detect("factorial")
[1] TRUE

Any idea?

@jtleek : I guess that students could also hide p-hacking from you this way :)

PS: Up to yesterday, I didn't know about tidycode and matahari, those tools are pretty cool!

Failure when :: and `=`

I think the parser has some issues with readr/data, but unsure if it's the use of a function data for an object?

code = 'library(readr)

# data from
url = paste0("", 
info = readLines(paste0(url, ".names"))
features = c("radius", "texture", "perimeter", "area", "smoothness", 
             "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension")
measures = c("mean", "se", "worst")
hdr = c(outer(features, measures, paste, sep = "_"))
hdr = c("id", "dx", hdr)
data = readr::read_csv(paste0(url, ".data"), col_names = hdr,
                       na = c("", "NA", "?"))
res = matahari::dance_recital(code)
out = get_package_functions(res$expr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> Error: Some of the packages in your call list have not been installed.
#> Please install the following package before proceeding:
#>  * data = readr

Created on 2021-03-01 by the reprex package (v1.0.0)

Meaning of the "score" column

Hi Dr.McGowan,

I'm trying to understand the score column of classification_tbl.csv file, but I couldn't find any documentation of the meaning of the variable and its role. I'd really appreciate it if you can explain this variable or point me towards where I can find information on this column. Thank you

Data Cleaning vs Exploratory

Hi Dr.McGowan,

I'm using your tidycode pkg for my independent study. I used it on one of the R scripts I have written in tidyverse syntax and compare the result to my (eye-balled) classification. There is one discrepancy where I would classify the functions as "Exploratory" rather than "Data Cleaning," which is what the tidycode package gave. I recreated those lines and replaced the dataset with the built-in dataset mtcars and obtained the same results (that the used functions such as summarize() and mean() are classified as Data Cleaning rather than exploratory):


mtcars %>% summarize(mean(hp, na.rm = TRUE))
mtcars %>% group_by(cyl) %>% summarize(mean(wt, na.rm = TRUE))

Does the package classify all dplyr functions to be Data Cleaning? Is there any way we can remedy this? Thank you.

Feature request: proportion of an R file classified to different categories?

Thanks for an amazing package. It is proving invaluable for a (very) nascent project in which my colleagues and I are trying to understand how beginning data scientists learn to visualize data.

One question: is there existing functionality - or would it be desirable to add functionality - for calculating the proportion of a total R file classified to different categories?

As I now write out an example, I wonder if this is trivial and something folks can just do; but also wonder if it would be helpful?


d <- read_rfiles(

u <- unnest_calls(d, expr)

p <- u %>%
    get_classifications("crowdsource", include_duplicates = FALSE)
  ) %>%
  dplyr::anti_join(get_stopfuncs()) %>%
  dplyr::select(file, func, classification)
#> Joining, by = "func"
#> Joining, by = "func"

f <- function(d) {
  d %>% 
    count(file, classification) %>% 
    group_by(file) %>% 
    mutate(prop = n / sum(n))

#> # A tibble: 7 x 4
#> # Groups:   file [2]
#>   file                                           classification     n  prop
#>   <chr>                                          <chr>          <int> <dbl>
#> 1 /Library/Frameworks/R.framework/Versions/3.6/… data cleaning      2 0.286
#> 2 /Library/Frameworks/R.framework/Versions/3.6/… exploratory        1 0.143
#> 3 /Library/Frameworks/R.framework/Versions/3.6/… setup              3 0.429
#> 4 /Library/Frameworks/R.framework/Versions/3.6/… visualization      1 0.143
#> 5 /Library/Frameworks/R.framework/Versions/3.6/… data cleaning      4 0.5  
#> 6 /Library/Frameworks/R.framework/Versions/3.6/… setup              1 0.125
#> 7 /Library/Frameworks/R.framework/Versions/3.6/… visualization      3 0.375

Created on 2019-11-22 by the reprex package (v0.3.0)

