lucymcgowan / tidycode Goto Github PK

License: Other

R 100.00%

tidycode's Issues

Failure when :: and `=`

I think the parser has some issues with readr/data, but unsure if it's the use of a function data for an object?

library(tidycode)
code = 'library(readr)

# data from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
url = paste0("https://archive.ics.uci.edu/ml/machine-learning-databases/", 
             "breast-cancer-wisconsin/wdbc")
info = readLines(paste0(url, ".names"))
features = c("radius", "texture", "perimeter", "area", "smoothness", 
             "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension")
measures = c("mean", "se", "worst")
hdr = c(outer(features, measures, paste, sep = "_"))
hdr = c("id", "dx", hdr)
data = readr::read_csv(paste0(url, ".data"), col_names = hdr,
                       na = c("", "NA", "?"))
'
res = matahari::dance_recital(code)
out = get_package_functions(res$expr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> Error: Some of the packages in your call list have not been installed.
#> Please install the following package before proceeding:
#>  * data = readr

^{Created on 2021-03-01 by the reprex package (v1.0.0)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS Catalina 10.15.7      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2021-03-01                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source                               
#>  assertthat    0.2.1   2019-03-21 [2] CRAN (R 4.0.0)                       
#>  backports     1.2.1   2020-12-09 [1] CRAN (R 4.0.2)                       
#>  cli           2.3.0   2021-01-31 [1] CRAN (R 4.0.2)                       
#>  clipr         0.7.1   2020-10-08 [1] CRAN (R 4.0.2)                       
#>  codetools     0.2-18  2020-11-04 [1] CRAN (R 4.0.2)                       
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.2)                       
#>  curl          4.3     2019-12-02 [2] CRAN (R 4.0.0)                       
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)                       
#>  ellipsis      0.3.1   2020-05-15 [2] CRAN (R 4.0.0)                       
#>  evaluate      0.14    2019-05-28 [2] CRAN (R 4.0.0)                       
#>  fs            1.5.0   2020-07-31 [2] CRAN (R 4.0.2)                       
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)                       
#>  highr         0.8     2019-03-20 [2] CRAN (R 4.0.0)                       
#>  hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.2)                       
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)                       
#>  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.2)                       
#>  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.2)                       
#>  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.2)                       
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)                       
#>  matahari      0.1.3   2020-02-06 [1] CRAN (R 4.0.2)                       
#>  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.2)                       
#>  pkgconfig     2.0.3   2019-09-22 [2] CRAN (R 4.0.0)                       
#>  pryr          0.1.4   2018-02-18 [1] CRAN (R 4.0.2)                       
#>  purrr         0.3.4   2020-04-17 [2] CRAN (R 4.0.0)                       
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)                       
#>  Rcpp          1.0.6   2021-01-15 [1] CRAN (R 4.0.2)                       
#>  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)                       
#>  reprex        1.0.0   2021-01-27 [1] CRAN (R 4.0.2)                       
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)                       
#>  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.2)                       
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)                       
#>  sessioninfo   1.1.1   2018-11-05 [2] CRAN (R 4.0.0)                       
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)                       
#>  stringr       1.4.0   2019-02-10 [2] CRAN (R 4.0.0)                       
#>  styler        1.3.2   2020-02-23 [2] CRAN (R 4.0.0)                       
#>  tibble        3.0.6   2021-01-29 [1] CRAN (R 4.0.2)                       
#>  tidycode    * 0.1.1   2021-03-01 [1] Github (LucyMcGowan/tidycode@f65c3f9)
#>  vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.2)                       
#>  withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.2)                       
#>  xfun          0.21    2021-02-10 [1] CRAN (R 4.0.2)                       
#>  yaml          2.2.1   2020-02-01 [2] CRAN (R 4.0.0)                       
#> 
#> [1] /Users/johnmuschelli/Library/R/4.0/library
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Feature request: proportion of an R file classified to different categories?

Thanks for an amazing package. It is proving invaluable for a (very) nascent project in which my colleagues and I are trying to understand how beginning data scientists learn to visualize data.

One question: is there existing functionality - or would it be desirable to add functionality - for calculating the proportion of a total R file classified to different categories?

As I now write out an example, I wonder if this is trivial and something folks can just do; but also wonder if it would be helpful?

library(tidyverse)
library(tidycode)

d <- read_rfiles(
  tidycode_example("example_plot.R"),
  tidycode_example("example_analysis.R")
)

u <- unnest_calls(d, expr)

p <- u %>%
  dplyr::inner_join(
    get_classifications("crowdsource", include_duplicates = FALSE)
  ) %>%
  dplyr::anti_join(get_stopfuncs()) %>%
  dplyr::select(file, func, classification)
#> Joining, by = "func"
#> Joining, by = "func"

f <- function(d) {
  d %>% 
    count(file, classification) %>% 
    group_by(file) %>% 
    mutate(prop = n / sum(n))
}

f(p)
#> # A tibble: 7 x 4
#> # Groups:   file [2]
#>   file                                           classification     n  prop
#>   <chr>                                          <chr>          <int> <dbl>
#> 1 /Library/Frameworks/R.framework/Versions/3.6/… data cleaning      2 0.286
#> 2 /Library/Frameworks/R.framework/Versions/3.6/… exploratory        1 0.143
#> 3 /Library/Frameworks/R.framework/Versions/3.6/… setup              3 0.429
#> 4 /Library/Frameworks/R.framework/Versions/3.6/… visualization      1 0.143
#> 5 /Library/Frameworks/R.framework/Versions/3.6/… data cleaning      4 0.5  
#> 6 /Library/Frameworks/R.framework/Versions/3.6/… setup              1 0.125
#> 7 /Library/Frameworks/R.framework/Versions/3.6/… visualization      3 0.375

^{Created on 2019-11-22 by the reprex package (v0.3.0)}

I'm using your tidycode pkg for my independent study. I used it on one of the R scripts I have written in tidyverse syntax and compare the result to my (eye-balled) classification. There is one discrepancy where I would classify the functions as "Exploratory" rather than "Data Cleaning," which is what the tidycode package gave. I recreated those lines and replaced the dataset with the built-in dataset mtcars and obtained the same results (that the used functions such as summarize() and mean() are classified as Data Cleaning rather than exploratory):

library(tidyverse)
data(mtcars)

mtcars %>% summarize(mean(hp, na.rm = TRUE))
mtcars %>% group_by(cyl) %>% summarize(mean(wt, na.rm = TRUE))

Does the package classify all dplyr functions to be Data Cleaning? Is there any way we can remedy this? Thank you.

Functional programming: symbols vs function calls

Hi,

I'm using tidycode to analyze students' code, and I wondered about something when looking at what follows:

> "purrr::map_dbl(mtcars, mean)" %>%
  dance_recital() %>%
  unnest_calls(expr)
# A tibble: 1 x 7
  value      error  output    warnings  messages  func    args      
  <list>     <list> <list>    <list>    <list>    <chr>   <list>    
1 <dbl [11]> <NULL> <chr [1]> <chr [0]> <chr [0]> map_dbl <list [2]>

I guess that the behavior above (i.e., the call to mean is not detected) is closely related to the fact that getParseData(parse(text = "map_dbl(mtcars, mean)")) detects mean as a SYMBOL.

The annoying thing is that, by using functionals, students can "hide" function calls. For instance, if I tell them to create a my_factorial function that does not call R's factorial but rather computes the factorial recursively, they can "cheat" and simply do my_factorial <- function(x) purrr::map_dbl(x, factorial).

> body(my_factorial) %>% deparse() %>% dance_recital() %>% unnest_calls(expr)
# A tibble: 3 x 7
  value  error      output warnings messages func    args      
  <list> <list>     <list> <list>   <list>   <chr>   <list>    
1 <NULL> <smplErrr> <NULL> <NULL>   <NULL>   ::      <list [2]>
2 <NULL> <smplErrr> <NULL> <NULL>   <NULL>   purrr   <list [2]>
3 <NULL> <smplErrr> <NULL> <NULL>   <NULL>   map_dbl <list [2]>

Right now, I prevent this by brute forcing the code analysis (i.e., I use stringr::str_detect ), but I find this solution somewhat unpleasant...

> body(my_factorial) %>% deparse() %>% stringr::str_detect("factorial")
[1] TRUE

Any idea?

@jtleek : I guess that students could also hide p-hacking from you this way :)

PS: Up to yesterday, I didn't know about tidycode and matahari, those tools are pretty cool!

is_model needs to allow for the function to have been called inside another function

Right now, I check is_model() based on the class of the value object -- this means a value has to be obtained, which wouldn't happen in a function 😢, so this needs to be fixed.

Add code to process scripts for classification

Meaning of the "score" column

Hi Dr.McGowan,

I'm trying to understand the score column of classification_tbl.csv file, but I couldn't find any documentation of the meaning of the variable and its role. I'd really appreciate it if you can explain this variable or point me towards where I can find information on this column. Thank you

lucymcgowan / tidycode Goto Github PK

tidycode's Issues

Failure when :: and `=`

Feature request: proportion of an R file classified to different categories?

Data Cleaning vs Exploratory

Functional programming: symbols vs function calls

is_model needs to allow for the function to have been called inside another function

Add code to process scripts for classification

Meaning of the "score" column

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent