Giter Site home page Giter Site logo

tplyr's Introduction

Tplyr

R build status Lifecycle: stable

Welcome to Tplyr! Tplyr is a traceability minded grammar of data format and summary. It’s designed to simplify the creation of common clinical summaries and help you focus on how you present your data rather than redundant summaries being performed. Furthermore, for every result Tplyr produces, it also produces the metadata necessary to give your traceability from source to summary.

As always, we welcome your feedback. If you spot a bug, would like to see a new feature, or if any documentation is unclear - submit an issue through GitHub right here.

Take a look at the cheatsheet!

Installation

You can install Tplyr with:

# Install from CRAN:
install.packages("Tplyr")

# Or install the development version:
devtools::install_github("https://github.com/atorus-research/Tplyr.git", ref="devel")

What is Tplyr?

dplyr from tidyverse is a grammar of data manipulation. So what does that allow you to do? It gives you, as a data analyst, the capability to easily and intuitively approach the problem of manipulating your data into an analysis ready form. dplyr conceptually breaks things down into verbs that allow you to focus on what you want to do more than how you have to do it.

Tplyr is designed around a similar concept, but its focus is on building summary tables common within the clinical world. In the pharmaceutical industry, a great deal of the data presented in the outputs we create are very similar. For the most part, most of these tables can be broken down into a few categories:

  • Counting for event based variables or categories
  • Shifting, which is just counting a change in state with a ‘from’ and a ‘to’
  • Generating descriptive statistics around some continuous variable.

For many of the tables that go into a clinical submission, the tables are made up of a combination of these approaches. Consider a demographics table - and let’s use an example from the PHUSE project Standard Analyses & Code Sharing - Analyses & Displays Associated with Demographics, Disposition, and Medications in Phase 2-4 Clinical Trials and Integrated Summary Documents.

When you look at this table, you can begin breaking this output down into smaller, redundant, components. These components can be viewed as ‘layers’, and the table as a whole is constructed by stacking the layers. The boxes in the image above represent how you can begin to conceptualize this.

  • First we have Sex, which is made up of n (%) counts.
  • Next we have Age as a continuous variable, where we have a number of descriptive statistics, including n, mean, standard deviation, median, quartile 1, quartile 3, min, max, and missing values.
  • After that we have age, but broken into categories - so this is once again n (%) values.
  • Race - more counting,
  • Ethnicity - more counting
  • Weight - and we’re back to descriptive statistics.

So we have one table, with 6 summaries (7 including the next page, not shown) - but only 2 different approaches to summaries being performed. In the same way that dplyr is a grammar of data manipulation, Tplyr aims to be a grammar of data summary. The goal of Tplyr is to allow you to program a summary table like you see it on the page, by breaking a larger problem into smaller ‘layers’, and combining them together like you see on the page.

Enough talking - let’s see some code. In these examples, we will be using data from the PHUSE Test Data Factory based on the original pilot project submission package. We’ve packaged some subsets of that data into Tplyr, which you can use to replicate our examples and run our vignette code yourself. Note: You can see our replication of the CDISC pilot using the PHUSE Test Data Factory data here.

tplyr_table(tplyr_adsl, TRT01P, where = SAFFL == "Y") %>% 
  add_layer(
    group_desc(AGE, by = "Age (years)")
  ) %>% 
  add_layer(
    group_count(AGEGR1, by = "Age Categories n (%)")
  ) %>% 
  build() %>% 
  kable()
row_label1 row_label2 var1_Placebo var1_Xanomeline High Dose var1_Xanomeline Low Dose ord_layer_index ord_layer_1 ord_layer_2
Age (years) n 86 84 84 1 1 1
Age (years) Mean (SD) 75.2 ( 8.59) 74.4 ( 7.89) 75.7 ( 8.29) 1 1 2
Age (years) Median 76.0 76.0 77.5 1 1 3
Age (years) Q1, Q3 69.2, 81.8 70.8, 80.0 71.0, 82.0 1 1 4
Age (years) Min, Max 52, 89 56, 88 51, 88 1 1 5
Age (years) Missing 0 0 0 1 1 6
Age Categories n (%) <65 14 ( 16.3%) 11 ( 13.1%) 8 ( 9.5%) 2 1 1
Age Categories n (%) >80 30 ( 34.9%) 18 ( 21.4%) 29 ( 34.5%) 2 1 2
Age Categories n (%) 65-80 42 ( 48.8%) 55 ( 65.5%) 47 ( 56.0%) 2 1 3

Tplyr is Qualified

We understand how important documentation and testing is within the pharmaceutical world. This is why outside of unit testing Tplyr includes an entire user-acceptance testing document, where requirements were established, test-cases were written, and tests were independently programmed and executed. We do this in the hope that you can leverage our work within a qualified programming environment, and that we save you a substantial amount of trouble in getting it there.

You can find the qualification document within this repository right here. The ‘uat’ folder additionally contains all of the raw files, programmatic tests, specifications, and test cases necessary to create this report.

The TL;DR

Here are some of the high level benefits of using Tplyr:

  • Easy construction of table data using an intuitive syntax
  • Smart string formatting for your numbers that’s easily specified by the user
  • A great deal of flexibility in what is performed and how it’s presented, without specifying hundreds of parameters

Where to go from here?

There’s quite a bit more to learn! And we’ve prepared a number of other vignettes to help you get what you need out of Tplyr.

  • The best place to start is with our Getting Started vignette at vignette("Tplyr")
  • Learn more about table level settings in vignette("table")
  • Learn more about descriptive statistics layers in vignette("desc")
  • Learn more about count layers in vignette("count")
  • Learn more about shift layers in vignette("shift")
  • Learn more about percentages in vignette("denom")
  • Learn more about calculating risk differences in vignette("riskdiff")
  • Learn more about sorting Tplyr tables in vignette("sort")
  • Learn more about using Tplyr options in vignette("options")
  • And finally, learn more about producing and outputting styled tables using Tplyr in vignette("styled-table")

In the Tplyr version 1.0.0, we’ve packed a number of new features in. For deeper dives on the largest new additions:

  • Learn about Tplyr’s traceability metadata in vignette("metadata") and about how it can be extended in vignette("custom-metadata")
  • Learn about layer templates in vignette("layer_templates")

References

In building Tplyr, we needed some additional resources in addition to our personal experience to help guide design. PHUSE has done some great work to create guidance for standard outputs with collaboration between multiple pharmaceutical companies and the FDA. You can find some of the resource that we referenced below.

Analysis and Displays Associated with Adverse Events

Analyses and Displays Associated with Demographics, Disposition, and Medications

Analyses and Displays Associated with Measures of Central Tendency

tplyr's People

Contributors

asbates avatar davisvaughan avatar elimillera avatar mattroumaya avatar mstackhouse avatar sadchla-codes avatar shiyuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tplyr's Issues

Multiple missing counts may not respect order variable assignments

Prerequisites

For more information, see the CONTRIBUTING guide.

Description

This might be an unlikely case, but it's an assumed capability. Tplyr allows you to set multiple missing count variables. If multiple values are assigned, order layer values may not be assigned properly in a nested layer.

This might be a design flaw in set_missing_count(). Only one sort value is assigned, but multiple 'missing' placeholders are allowed.

Steps to Reproduce (Bug Report Only)

adsl <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adsl.xpt"))

t <- tplyr_table(adsl, TRT01A) %>%
  add_layer(
    group_count(vars(EOSSTT, DCDECOD)) %>%
      set_order_count_method(c("byfactor", "bycount")) %>% 
      add_total_row(f_str('xx', n), count_missing=FALSE) %>% 
      set_missing_count(f_str("xxx", n), sort_value=Inf, denom_ignore=TRUE, Missing = NA, Blank = '')
  ) 

t %>% 
  build() %>% 
  arrange(ord_layer_1, desc(ord_layer_2)) 

# A tibble: 14 × 8
   row_label1   row_label2                       var1_Placebo  `var1_Xanomeline High Dose` `var1_Xanomeline Low Dose` ord_layer_index ord_layer_1 ord_layer_2
   <chr>        <chr>                            <chr>         <chr>                       <chr>                                <int>       <dbl>       <dbl>
 1 Blank        "Blank"                          " 0 (  0.0%)" " 0 (  0.0%)"               " 0 (  0.0%)"                            1           1         Inf
 2 COMPLETED    "COMPLETED"                      "58 ( 67.4%)" "27 ( 32.1%)"               "25 ( 29.8%)"                            1           2         Inf
 3 COMPLETED    "   COMPLETED"                   "58 ( 67.4%)" "27 ( 32.1%)"               "25 ( 29.8%)"                            1           2          58
 4 DISCONTINUED "DISCONTINUED"                   "28 ( 32.6%)" "57 ( 67.9%)"               "59 ( 70.2%)"                            1           3         Inf
 5 DISCONTINUED "   WITHDRAWAL BY SUBJECT"       " 9 ( 10.5%)" " 8 (  9.5%)"               "10 ( 11.9%)"                            1           3           9
 6 DISCONTINUED "   ADVERSE EVENT"               " 8 (  9.3%)" "40 ( 47.6%)"               "44 ( 52.4%)"                            1           3           8
 7 DISCONTINUED "   LACK OF EFFICACY"            " 3 (  3.5%)" " 1 (  1.2%)"               " 0 (  0.0%)"                            1           3           3
 8 DISCONTINUED "   DEATH"                       " 2 (  2.3%)" " 0 (  0.0%)"               " 1 (  1.2%)"                            1           3           2
 9 DISCONTINUED "   PROTOCOL VIOLATION"          " 2 (  2.3%)" " 3 (  3.6%)"               " 1 (  1.2%)"                            1           3           2
10 DISCONTINUED "   STUDY TERMINATED BY SPONSOR" " 2 (  2.3%)" " 3 (  3.6%)"               " 2 (  2.4%)"                            1           3           2
11 DISCONTINUED "   LOST TO FOLLOW-UP"           " 1 (  1.2%)" " 0 (  0.0%)"               " 1 (  1.2%)"                            1           3           1
12 DISCONTINUED "   PHYSICIAN DECISION"          " 1 (  1.2%)" " 2 (  2.4%)"               " 0 (  0.0%)"                            1           3           1
13 Missing      "Missing"                        " 0 (  0.0%)" " 0 (  0.0%)"               " 0 (  0.0%)"                            1           4         Inf
14 Total        "Total"                          "86 (100.0%)" "84 (100.0%)"               "84 (100.0%)"                            1           5         Inf

Expected behavior: [What you expected to happen]

The 'Blank' row should be arranged with 'Missing' and 'Total down below.

Actual behavior: [What actually happened]

'Blank' actually gets the value of 1 for ord_layer_1 which is unexpected.

Note: This was found on branch gh_issue_32 which is branched off of devel.

add_total_row fails on nested count layer

Prerequisites

For more information, see the CONTRIBUTING guide.

Description

add_total_row fails on nested count layer

Steps to Reproduce (Bug Report Only)

library(magrittr)
#> Warning: package 'magrittr' was built under R version 4.0.5
library(Tplyr)

adae <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adae.xpt"))
tplyr_table(adae, TRTA) %>%
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>%
      add_total_row(f_str("xxx", n), sort_value=-Inf)
  ) %>% 
  build()
#> Error:
#> ! Assigned data `1:nrow(group_data[-1, ])` must be compatible with row subscript `-1`.
#> x 0 rows must be assigned.
#> x Assigned data has 2 rows.
#> ℹ Only vectors of size 1 are recycled.

Created on 2022-03-27 by the reprex package (v0.3.0)

Expected behavior:

This shouldn't error, but I'm thinking that we should disallow total rows on nested count layers. This is because:

  • Within a nested layer, the total row is the parent group
  • The control of this can be rather intricate, and without more specific examples it would make more sense to create a separate layer.

Actual behavior:

Error occurs within internal method add_data_order_nested()

Versions

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Tplyr_0.4.4    shiny_1.5.0    testthat_3.1.2 dplyr_1.0.7   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8        tidyr_1.2.0       prettyunits_1.1.1 ps_1.6.0          assertthat_0.2.1 
 [6] rprojroot_2.0.2   digest_0.6.29     utf8_1.2.2        mime_0.12         R6_2.5.1         
[11] reprex_0.3.0      evaluate_0.14     pillar_1.7.0      rlang_1.0.1       rstudioapi_0.13  
[16] jquerylib_0.1.4   whisker_0.4       callr_3.7.0       blob_1.2.1        rmarkdown_2.10   
[21] desc_1.4.0        devtools_2.4.3    stringr_1.4.0     compiler_4.0.2    httpuv_1.5.4     
[26] xfun_0.28         pkgconfig_2.0.3   pkgbuild_1.2.1    clipr_0.8.0       htmltools_0.5.2  
[31] tidyselect_1.1.2  tibble_3.1.6      fansi_1.0.2       crayon_1.5.0      withr_2.4.3      
[36] later_1.1.0.1     brio_1.1.3        jsonlite_1.7.2    xtable_1.8-4      lifecycle_1.0.1  
[41] DBI_1.1.0         magrittr_2.0.2    cli_3.2.0         stringi_1.7.6     cachem_1.0.6     
[46] fs_1.5.1          promises_1.1.1    remotes_2.4.2     bslib_0.2.4       ellipsis_0.3.2   
[51] generics_0.1.2    vctrs_0.3.8       tools_4.0.2       forcats_0.5.1     glue_1.6.1       
[56] purrr_0.3.4       processx_3.5.2    pkgload_1.2.4     fastmap_1.1.0     sessioninfo_1.2.2
[61] memoise_2.0.1     knitr_1.36        sass_0.4.0.9000   usethis_2.1.3 

Note - this is based off the devel branch.

Generally improve handling of "empty"

String formatting has a lot of opportunities for improvement - but here are some general ideas:

  • It would be helpful for complete 0 counts to accept the .overall argument in empty from an f_str. Count layers 0 fill rather than producing NAs, so "empty" in that sense could be considered all 0 values
  • There's no threshold handling at the moment. Allowing case_when syntax on an f_str thrown into an argument would offer flexibility. Some scenarios could be generally challenging though. For example - if the desired format is "xx (<1%)".
    • An idea here could be thresholds for individual variables, and then an overall threshold as separate parameters, and overall takes precedence.

Descriptive stats as columns

Some tables require that the descriptive stats are columns instead of rows. Essentially, this should be a pivot of the treatment variable with the descriptive stats.

The default layout of a desc layer might look as follows:

adsl <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adsl.xpt"))

tplyr_table(adsl, TRT01P) %>% 
  add_layer(
    group_desc(AGE)
  ) %>% 
  build()

# A tibble: 6 x 6
  row_label1 var1_Placebo   `var1_Xanomeline High Dose` `var1_Xanomeline Low Dose` ord_layer_index ord_layer_1
  <chr>      <chr>          <chr>                       <chr>                                <int>       <int>
1 n          " 86"          " 84"                       " 84"                                    1           1
2 Mean (SD)  "75.2 ( 8.59)" "74.4 ( 7.89)"              "75.7 ( 8.29)"                           1           2
3 Median     "76.0"         "76.0"                      "77.5"                                   1           3
4 Q1, Q3     "69.2, 81.8"   "70.8, 80.0"                "71.0, 82.0"                             1           4
5 Min, Max   "52, 89"       "56, 88"                    "51, 88"                                 1           5
6 Missing    "  0"          "  0"                       "  0"                                    1           6

The idea here would be to flip this presentation, so that the treatment variables are in rows and the stats are in columns:

dat <- tplyr_table(adsl, TRT01P) %>% 
  add_layer(
    group_desc(AGE)
  ) %>% 
  build()

dat %>% 
  pivot_longer(
    c('var1_Placebo', 'var1_Xanomeline High Dose', 'var1_Xanomeline Low Dose'),
    names_to = 'treat_var',
    values_to = 'var1'
    ) %>% 
  pivot_wider(
    treat_var,
    names_from = 'row_label1',
    values_from = 'var1'
  )

# A tibble: 3 x 7
  treat_var                 n     `Mean (SD)`  Median `Q1, Q3`   `Min, Max` Missing
  <chr>                     <chr> <chr>        <chr>  <chr>      <chr>      <chr>  
1 var1_Placebo              " 86" 75.2 ( 8.59) 76.0   69.2, 81.8 52, 89     "  0"  
2 var1_Xanomeline High Dose " 84" 74.4 ( 7.89) 76.0   70.8, 80.0 56, 88     "  0"  
3 var1_Xanomeline Low Dose  " 84" 75.7 ( 8.29) 77.5   71.0, 82.0 51, 88     "  0"  

Calculate missing row based on subjects not in target data that are in pop_data

adsl <-
  tribble(~ USUBJID, ~TRT,
          "001", "TRTA",
          "002", "TRTB")

adae <-
  tribble(~ USUBJID, ~TRT, ~ AEBODSYS,
          "001", "TRTA", "An AE")

t <- tplyr_table(adae, TRT) %>%
  set_pop_data(adsl) %>%
  add_layer(
    group_count(AEBODSYS) %>%
      set_missing_count(f_str("xx", n), missing = NA)
  )

build(t)
Should result in 
# A tibble: 2 × 4
  row_label1 var1_TRTA    ord_layer_index ord_layer_1
  <chr>      <chr>                  <int>       <dbl>
1 An AE      "1 (100.0%)"               1           1
2 missing    " 1"                       1           2

Alternate rounding option?

Hi Eli and Mike,

I am looking at using Tplyr to validate tables created in SAS

I am not having issues comparing medians and other percentiles because I am using
options(tplyr.quantile_type = 3)

But sometimes the data generated by the group_count() layer
set_count_layer_formats(n_counts = f_str("xx (xxx%)", n, pct))
is off by a digit. And Proc Compare throws it out, causing it to "fail" validation.

I had been reading this blog by Ali Dootson
TFL programming in R versus SAS

And found a function they wrote that emulates SAS rounding.

ut_round <- function(x, n=0)
{
 # x is the value to be rounded
 # n is the precision of the rounding
 scale <- 10^n
 y <- trunc(x * scale + sign(x) * 0.5) / scale
 # Return the rounded number
 return(y)
}

I have been using this to post-process the Tplyr tibble created and update the nnn (nnn%) strings
so I can get Proc Compare to accept the results.

Any chance you could come up with an alternate rounding option?
Perhaps something like
options(tpylr.rounding = {default|SAS}?

Right now I am only seeing the issue with percentages generated in group_count()

Thanks so much!
Robert

devel branch not supporting denom/total as formatable value for count layers

Prerequisites

For more information, see the CONTRIBUTING guide.

Description

The current dev branch isn't respecting denom or total as a valid name within an f_str() in set_format_strings()

Steps to Reproduce (Bug Report Only)

t <- tplyr_table(mtcars, cyl) %>% 
  add_layer(
     group_count(gear) %>% 
        set_format_strings('xx (xx.x) xx', n, pct, denom)
   ) 
# Error: In `set_format_string` entry 1 is not an `f_str` object. All assignmentes made within `set_format_string` must be made using the function `f_str`. See the `f_str` documentation.

Expected behavior: [What you expected to happen]

If you use get_numeric_data(), the denom is available as total. Count layers need a valid name available to use to add denom to the formatted string.

Actual behavior: [What actually happened]

Error above

Versions

This is using the current devel branch

Sorting AE table by descending count when the counts are not unique

Hi folks,
I am going through this example of an AE table using the CDISC adsl and adae data.
I have limited the number of SOC (AEBODSYS) terms to 5 here.
When I run this code, I notice that the last two AEBODSYS terms and AEDECODs are piled on top of each other.

I came up with a workaround, but I was wondering if there was a proper way of doing this in Tplyr that I overlooked?
Thanks!
Robert

library(dplyr)
library(Tplyr)

data(adsl) # sample data from CDISC Pilot
data(adae) # sample data from CDISC Pilot

adsl2 <- adsl %>% 
  filter(SAFFL == "Y")

soc_vec <- c("SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
             "NERVOUS SYSTEM DISORDERS",
             "HEPATOBILIARY DISORDERS",
             "IMMUNE SYSTEM DISORDERS",
             "SOCIAL CIRCUMSTANCES"
             )
adae2 <- adae %>% 
  filter(AEBODSYS %in% soc_vec) %>% 
  arrange(AEBODSYS, AEDECOD) %>% 
  mutate(SOC = AEBODSYS, PT = AEDECOD)

adae_tp <- tplyr_table(adae2, TRTA) %>%
  set_pop_data(adsl2) %>%
  set_pop_treat_var(TRT01A) %>%
  set_pop_where(TRUE) %>%
  
  add_total_group(group_name = "XTotal") %>% 
  add_layer(
    group_count('Number of subjects with any event') %>% 
      set_distinct_by(USUBJID) %>% 
      set_denoms_by(TRTA) %>% 
      set_format_strings(f_str("xxx ( xx.x)", distinct_n, distinct_pct))
  ) %>% 
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>% 
      set_nest_count(TRUE) %>% 
      set_indentation("..") %>%
      set_format_strings(f_str("xx (xx.x%)", distinct_n, distinct_pct)) %>% 
      set_distinct_by(USUBJID) %>%
      set_order_count_method("bycount") %>%
      set_ordering_cols("XTotal") %>% 
      set_result_order_var(distinct_n)
  ) 

adae_df <- adae_tp %>% 
  build() %>% 
  arrange(ord_layer_index, desc(ord_layer_1), desc(ord_layer_2))

image

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.2 Tplyr_0.4.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        compiler_4.0.2    pillar_1.4.6      forcats_0.5.0    
 [5] prettyunits_1.1.1 remotes_2.2.0     tools_4.0.2       testthat_2.3.2   
 [9] digest_0.6.25     pkgbuild_1.1.0    pkgload_1.1.0     memoise_1.1.0    
[13] lifecycle_0.2.0   tibble_3.0.3      pkgconfig_2.0.3   rlang_0.4.7      
[17] cli_2.0.2         rstudioapi_0.11   curl_4.3          yaml_2.2.1       
[21] stringr_1.4.0     withr_2.2.0       hms_0.5.3         desc_1.2.0       
[25] generics_0.0.2    fs_1.5.0          vctrs_0.3.4       devtools_2.3.1   
[29] rprojroot_1.3-2   tidyselect_1.1.0  glue_1.4.2        R6_2.4.1         
[33] processx_3.4.4    fansi_0.4.1       sessioninfo_1.1.1 readr_1.3.1      
[37] tidyr_1.1.2       purrr_0.3.4       callr_3.4.3       magrittr_1.5     
[41] backports_1.1.9   ps_1.3.4          ellipsis_0.3.1    usethis_1.6.1    
[45] assertthat_0.2.1  stringi_1.4.6     crayon_1.3.4

Update documentation for `set_indentation`

Update count vignette where it says:
The default indentation used will be 3 spaces, but as you can see here - you can set the indentation however you like using set_indentation().
Update To:
The default indentation used will be 3 spaces, but as you can see here - you can set the indentation however you like using set_indentation().

Also update the default for set_indentation to " "(three spaces) to make it clear what the default is

Bug when POSIXct variable present

Hi,

Thanks for great work on Tplyr! My team and I have been using it for some time. Recently, we noticed a strange behavior whenever POSIXct class variables exists in the "pop data" below. I've created a minimally reproducible example using the cdisc pilot data. The issue goes away if you get rid of the POSIXct variable or if you convert it to character, which is not an ideal workaround. Wondering if you could take a look or let me know if I'm doing something wrong?

> library(dplyr)
> library(Tplyr)
> 
> # use as needed
> # setwd("path/to/files")
> 
> cdisc_adsl <- haven::read_xpt("data-raw/adsl.xpt")
> cdisc_adae <- haven::read_xpt("data-raw/adae.xpt")
> 
> # Add in POSIXct variable
> adsl2 <- cdisc_adsl %>%
+   mutate(fake_dttm = as.POSIXct("2019-01-01 10:10:10"), origin = "1970-01-01")
> 
> str(adsl2$fake_dttm) 
POSIXct[1:254], format: "2019-01-01 10:10:10" "2019-01-01 10:10:10" "2019-01-01 10:10:10" "2019-01-01 10:10:10" ...
>
> # Make sure TRT01P exists in ADAE
> adae2 <- cdisc_adae %>%
+   left_join(adsl2 %>% select(USUBJID, TRT01P), "USUBJID")
> 
> # Create table
> tp_obj <- Tplyr::tplyr_table(adae2, TRT01P) %>% 
+   Tplyr::set_pop_data(adsl2) %>%
+   Tplyr::add_layer(
+     group_count('Number of subjects with any event') %>% 
+       Tplyr::set_distinct_by(USUBJID) %>% 
+       Tplyr::set_denoms_by(TRT01P)
+   )  
> 
> tp_obj %>% Tplyr::build() # error 
Error in as.POSIXlt.character(x, tz, ...) : 
  character string is not in a standard unambiguous format

And my sessionInfo() is below. Notice I ran this on linux OS but my team has encountered the same issue on windows. Thanks, let me know if I can provide any other info!

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.9 (Maipo)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Tplyr_0.4.2       rstudioapi_0.13   dplyr_1.0.3       readxl_1.3.1     
[5] haven_2.3.1       r2rtf_0.3.1      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        tidyr_1.1.2       prettyunits_1.1.1 ps_1.5.0         
 [5] assertthat_0.2.1  rprojroot_2.0.2   digest_0.6.27     mime_0.9         
 [9] plyr_1.8.6        cellranger_1.1.0  R6_2.5.0          ggplot2_3.3.3    
[13] pillar_1.4.7      rlang_0.4.10      callr_3.5.1       config_0.3.1     
[17] desc_1.2.0        stringr_1.4.0     munsell_0.5.0     tinytex_0.29     
[21] shiny_1.6.0       compiler_4.0.3    httpuv_1.5.5      xfun_0.20        
[25] pkgconfig_2.0.3   pkgbuild_1.2.0    htmltools_0.5.1.1 insight_0.14.5   
[29] tidyselect_1.1.0  tibble_3.0.5      roxygen2_7.1.1    attempt_0.3.1    
[33] fansi_0.4.2       crayon_1.3.4      withr_2.4.1       later_1.1.0.1    
[37] grid_4.0.3        jsonlite_1.7.2    xtable_1.8-4      gtable_0.3.0     
[41] lifecycle_0.2.0   DBI_1.1.1         dockerfiler_0.1.4 magrittr_2.0.1   
[45] scales_1.1.1      cli_2.2.0         stringi_1.5.3     fs_1.5.0         
[49] promises_1.1.1    remotes_2.2.0     testthat_3.0.1    xml2_1.3.2       
[53] ellipsis_0.3.1    generics_0.1.0    vctrs_0.3.6       sjlabelled_1.1.8 
[57] tools_4.0.3       forcats_0.5.1     golem_0.3.1       glue_1.4.2       
[61] purrr_0.3.4       hms_1.0.0         processx_3.4.5    pkgload_1.1.0    
[65] fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-0  gt_0.2.2         
[69] knitr_1.31        usethis_2.0.0    

Infinities produced in min/max when no obs variable when using multiple targets

If multiple target variables are used in group_desc(), and all observations for the target variable are null, Inf and -Inf may be presented for min and max values.

Responsibility of where this is corrected isn't immediately clear. We don't want to by default mask infinity values returned by summary functions, and we don't want to produced unexpected results from our summary functions.

Add overall N to count_layer?

I have a table I am trying to produce with Tplyr
This happens to be a Phase-1 study, so N=8 in each column

image

Is there, or can there be, a way of specifying the overall N= inside the group_count() layer?

set_format_strings(f_str("xx/xx (xxx.x)", distinct_n, N, distinct_pct)) ?

Thanks,
Robert

Allow for manual control of decimal precision by parameter

DP control Rules:
0,+1,+2 [but not auto, still user supplied (vector in the source df, varies by PARAMCD)]

Example

f_str_ <- ???
t <- tplyr_table(adlb, TRTA) %>%
  add_layer(
    group_desc(AVAL, by = PARAM) %>%
      set_format_strings(
       f_str_
      )
    ) %>%
  build()

Have separate columns for count and descriptive data

Description

Is there a way we can have separate columns for the categories of count data and the statistics of desc data when used together? Right now, both the categories ans statistics get populated in the same column.

Looking for something like the table attached here:

Example of Table

image

Tplyr not respecting factor levels of treatment variables

Issues is line 28 in prebuild.R. Issue is fct_expand is overwriting the factors of the original target dataset

Reprex

devtools::load_all("~/Tplyr")
library(dplyr)

subjects <- safetyData::adam_adsl %>%
  mutate(TRTA = as.factor(TRT01P)) %>%
  filter(DCREASCD  == "Physician Decision")

adae_filtered <- safetyData::adam_adae %>%
  mutate(TRTA = as.factor(TRTA)) %>%
  filter(USUBJID %in% subjects$USUBJID)

t <- tplyr_table(adae_filtered, TRTA) %>%
  set_pop_data(subjects) %>%
  #set_pop_treat_var(TRT01P) %>%
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>%
      set_distinct_by(USUBJID) %>%
      set_format_strings(f_str('xx (xx.x%)', distinct_n, distinct_pct))
  )

t2 <- build(t)

t2 %>% select(starts_with('row'), var1_Placebo, `var1_Xanomeline Low Dose`, `var1_Xanomeline High Dose`) %>%
  add_column_headers(
    paste0(" Body System | Term | Placebo (N=**Placebo**) | Xanomeline Low Dose (N=**Xanomeline Low Dose**) ",
           "| Xanomeline High Dose (N=**Xanomeline High Dose**)"),
    header_n(t))

Add data peeking in Tplyr functions

This is the functionality that allows dplyr to know the variables in the data you're working with when calling select or filter. Documentation is in tidyselect::peek_data

Export metadata functions and allow extension of metadata dataframe

Prerequisites

For more information, see the CONTRIBUTING guide.

Description

In addition to the new metadata functionality, I'd like to enhance the following:

  • Export the tplyr_meta functions to allow users to directly construct their own tplyr_meta objects. This involves tplyr_meta(), add_names(), and add_filters()
  • Add capability to add records to the tplyr_table$metadata dataframe internal to the tplyr_table object
  • Update get_meta_result() and get_meta_subset() to S3 methods to read from a dataframe directly or a tplyr_table

External modifications to metadata would be the responsibility of the developer - so this would need to be explicit and wouldn't have the automated construction that comes out of the tplyr_table build.

Example

my_df <- tibble(
   row_id = "x_1",
   var1_Placebo = tplyr_meta(
      quos(a, b, c),
      quos(a = 1, b = 2, c = 3) 
)

t <- append_metadata(t, my_df)
# Pull out the metadata dataframe - you can then modify and expand
# as desired and manipulate however you want
meta <- get_metadata(t)

# Still relying on row_id as the index, point to result cells - and this essentially
# skips the step of extracting the metadata object from a tplyr_table
get_meta_result(meta, 'c1_1', 'var1_Placebo')
get_meta_subset(meta, 'c1_1', 'var1_Placebo')

This should be extended into #55

Metadata dataframe build

This feature would return metadata associated with each layer in a usable format, so that after the dataframe is built, the metadata may accompany the Tplyr table in a usable way.

`apply_row_masks()` doesn't respect descending sorting of row break variables

Prerequisites

For more information, see the CONTRIBUTING guide.

Description

apply_row_masks() is intended to be used post-sorting, as it blanks row labels out assuming that the repetitive values are already next to each other. Currently, if using additional break variables to insert row breaks, this fails to consider if one of the break variables has been sorted in descending order, leading to an unexpected result.

Steps to Reproduce (Bug Report Only)

adae <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adae.xpt")) %>% 
  filter(AEBODSYS %in% c("IMMUNE SYSTEM DISORDERS", "CONGENITAL, FAMILIAL AND GENETIC DISORDERS"))

# Create the Tplyr table object - this is like a specification for the table
t <- tplyr_table(adae, TRTA) %>% 
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>% 
      set_distinct_by(USUBJID) %>% 
      set_order_count_method("bycount", break_ties='desc') %>%
      set_ordering_cols("Xanomeline High Dose") %>%
      set_result_order_var(distinct_n)
  )

# Now build the table - this is where the number crunching happens
ae1 <- t %>% 
  build() %>% 
  arrange(desc(ord_layer_1), desc(ord_layer_2)) %>% 
  apply_row_masks(row_breaks=TRUE, ord_layer_1)

Expected behavior: [What you expected to happen]

CONGENITAL, FAMILIAL AND GENETIC DISORDERS should sort first, and IMMUNE SYSTEM DISORDERS should sort second.

Actual behavior: [What actually happened]

IMMUNE SYSTEM DISORDERS sorts first and CONGENITAL, FAMILIAL AND GENETIC DISORDERS sorts second. Ascending sorting is reimplemented for ord_layer_1.

** Workaround **

The desired sort order can be achieved by re-sorting the data with the newly added ord_break variable.

ae1 %>% 
  arrange(desc(ord_layer_1), desc(ord_layer_2)) %>% 
  apply_row_masks(row_breaks = TRUE, ord_layer_index, ord_layer_1) %>% 
  arrange(desc(ord_layer_1), ord_break, desc(ord_layer_2))

This is far from desirable, but it works.

Tplyr and rlang 1.0.0

Hello, I see in the revdep checks for the next version of rlang:

`Tplyr:::modify_nested_call(mean(c(1, 2, 3)))` threw an error with unexpected message.
Expected match: "`call` must be a quoted call"
Actual message: "`call` must be a defused call, not a number."

This is because you are testing for the contents of an error message generated in rlang:

test_that("Call must be quoted", {
  expect_error(Tplyr:::modify_nested_call(mean(c(1,2,3))), "`call` must be a quoted call")
  c <- quo(tplyr_table(treat_var = Species))
  expect_silent(Tplyr:::modify_nested_call(c))
})

If you must monitor foreign errors, could you please do it in snapshots? See https://testthat.r-lib.org/articles/snapshotting.html

The great advantage of snapshots is that they only fail locally or in Github actions. This allows you to monitor the appearance of errors over time without causing check failures when things change upstream.

Flag potential improper use of by variables

Description

When by variables are provided into Tplyr, there's a complete run to fill NA values. There's an inherent assumption we've made where you won't provide two by variables that are 1:1. For example, VISITNUM and VISIT. This is because Tplyr can autoamtically detect that AVISITN should be using for sorting AVISIT, based on ADaM assumptions.

If you do provide VISITNUM and VISIT, it will duplicate all of those records and essentially cartesian join the results because of the dplyr::complete() calls we run. This is necessary to an extent, because we want to provide the 0 rows of factor combinations if with no results in the data, but makes these scenarios a bit unintuitive and confusing.

As a preventative measure - we should introduce a warning if we notice a large proportional increate of records due to the complete. Gauging what that ratio increase is a little tough - but maybe somehting like if we notice that rows increase by 50% then produce a warning.

Add search bar to website

Description

I think you might need to update your version of bootstrap, but it would be helpful to have a search bar on the pkgdown

Add 'tidy' methods for Tplyr tables

When programming a table it would be nice to use filter to call filter.tplyr_table to be able to add filters or 'where' clauses on before or after the build.

Example

tplyr_table(adsl, TRT01P) %>% 
  add_layer(
    group_desc(AGE, by = "Age (years)")
  ) %>% 
  filter(SAFFL == "Y") %>%
  build()

Standardize numeric data return format

To support tlang, get_numeric_data() is about 90% of the way there - but the format needs to be unified and collapsed to get there. To support this, we have two options:

  • Create a build_tlang() function
  • Repurpose get_numeric_data() to return the tlang compatible output.

Native pipe compatibility

Currently layer syntax relies strictly on use of the magrittr pipe. The native pipe (|>) available in R >4.1.0 does not work.

Allow storing and usage of layer templates

Tplyr has the ability to store formatting at the table level or pulled from options, but tables can have many other settings and configurations. Another way to eliminate redundancy would be creating layer templates. The general idea would look something like:

example_template <- new_layer_template(
   group_count(...) %>%
      set_format_strings(f_str("xxx (xx.xx%)", n, pct)) %>%
      set_missing_count(f_str("xxx", n), sort_value=Inf, denom_ignore=TRUE, Missing = NA)
)

tplyr_table(adsl, TRT01P) %>%
   add_layer(
      example_template(RACE)
   ) %>%
   build()

This would allow for the generation of a layer with the exact same settings based on the template. The example_template function itself would take the same variables as the group_<type> family, but could still be extended itself using other layer modifier functions.

add_total_row has no effect for shift layers

Prerequisites

For more information, see the CONTRIBUTING guide.

Description

On shift layers, add_total_row() is accepted but has no effect, while set_missing_count() explicitly produces an error.

Steps to Reproduce (Bug Report Only)

Struggled with reprex for this...

adlb <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adlbc.xpt"))

t <- tplyr_table(adlb, TRTA, where=PARAMCD == "CK" & AVISIT == "Week 2") %>%
  add_layer(
    group_shift(vars(row = BNRIND, column = ANRIND), by = vars(PARAM, AVISIT)) %>%
      set_format_strings(f_str("x", n))
  ) 

t %>%
  build() 

Expected behavior: [What you expected to happen]

Either an error to signify not supported on Shift layers, or a total row.

Actual behavior: [What actually happened]

Table completes build with no messages

Versions

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_2.0.2 Tplyr_0.4.4    shiny_1.5.0    testthat_3.1.2 dplyr_1.0.7   

loaded via a namespace (and not attached):
 [1] sass_0.4.0.9000   pkgload_1.2.4     tidyr_1.2.0       jsonlite_1.7.2    bslib_0.2.4       brio_1.1.3        assertthat_0.2.1  highr_0.9        
 [9] blob_1.2.1        yaml_2.2.1        remotes_2.4.2     sessioninfo_1.2.2 pillar_1.7.0      glue_1.6.1        digest_0.6.29     promises_1.1.1   
[17] htmltools_0.5.2   httpuv_1.5.4      clipr_0.8.0       pkgconfig_2.0.3   devtools_2.4.3    haven_2.4.3.9001  purrr_0.3.4       xtable_1.8-4     
[25] processx_3.5.2    whisker_0.4       later_1.1.0.1     tzdb_0.2.0        tibble_3.1.6      generics_0.1.2    usethis_2.1.3     ellipsis_0.3.2   
[33] cachem_1.0.6      withr_2.4.3       cli_3.2.0         crayon_1.5.0      mime_0.12         memoise_2.0.1     evaluate_0.14     ps_1.6.0         
[41] fs_1.5.1          fansi_1.0.2       forcats_0.5.1     pkgbuild_1.2.1    tools_4.0.2       prettyunits_1.1.1 hms_1.1.1         lifecycle_1.0.1  
[49] stringr_1.4.0     reprex_0.3.0      callr_3.7.0       jquerylib_0.1.4   compiler_4.0.2    rlang_1.0.1       rstudioapi_0.13   rmarkdown_2.10   
[57] waldo_0.3.1       DBI_1.1.0         R6_2.5.1          knitr_1.36        fastmap_1.1.0     utf8_1.2.2        rprojroot_2.0.2   readr_2.1.2      
[65] desc_1.4.0        stringi_1.7.6     Rcpp_1.0.8        vctrs_0.3.8       tidyselect_1.1.2  xfun_0.28        

_Note: This is currently from branch gh_issue_32 which is based on devel

Incorrect denom counts when creating combined treatment group

Description

[Description of the bug or feature]

Steps to Reproduce (Bug Report Only)

I executed the following code:

adae_table <- tplyr_table(adae, TRTA) %>%
  add_treat_grps("Treated" = c("Xanomeline High Dose", "Xanomeline Low Dose")) %>%
  add_layer(
    group_count(AEDECOD) %>%
      set_distinct_by(USUBJID) %>% 
      set_nest_count(TRUE)
  )

adae_build <- adae_table %>%
  build()

Expected behavior: The creation of a combined treatment group named "Treated", and accurately calculated summary statistics for all treatment groups.

Actual behavior: While a new treatment group named "Treated" was created, and denominator counts and percentages were correctly calculated, this is not the case for the original two treatment groups. "Xanomeline High Dose" and "Xanomeline Low Dose" denominator counts are not being correctly calculated, causing the percentages to be rendered as "(Inf%)".

Versions

R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 [5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] huxtable_5.0.0 magrittr_1.5 Tplyr_0.1.1 forcats_0.5.0 stringr_1.4.0 [6] dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3 [11] ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.6 compiler_4.0.2 dbplyr_1.4.4 [6] odbc_1.2.3 tools_4.0.2 bit_1.1-15.2 lubridate_1.7.9 jsonlite_1.7.0 [11] lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7 reprex_0.3.0 [16] cli_2.0.2 DBI_1.1.0 rstudioapi_0.11 yaml_2.2.1 haven_2.3.1 [21] xfun_0.15 withr_2.2.0 xml2_1.3.2 httr_1.4.2 knitr_1.29 [26] fs_1.4.2 hms_0.5.3 generics_0.0.2 vctrs_0.3.2 bit64_0.9-7.1 [31] grid_4.0.2 tidyselect_1.1.0 glue_1.4.1 R6_2.4.1 fansi_0.4.1 [36] readxl_1.3.1 modelr_0.1.8 blob_1.2.1 backports_1.1.8 scales_1.1.1 [41] ellipsis_0.3.1 rvest_0.3.5 assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.6 [46] munsell_0.5.0 broom_0.7.0 crayon_1.3.4

Add `apply_formats()` function from connect the dots presentation

Function provided for post:

apply_formats <- function(format_string, ..., empty = c(.overall = "")) {
   format <- f_str(format_string, ..., empty=empty)
   pmap_chr(list(...), function(...) apply_fmts(...), fmt=format)
}
apply_fmts <- function(..., fmt) {
   nums <- list(...)
   repl <- vector('list', length(fmt$settings))
   for (i in seq_along(fmt$settings)) {
      repl[[i]] <- Tplyr:::num_fmt(nums[[i]], i, fmt=fmt)
   }
   args <- append(list(fmt$repl_str), repl)
   do.call('sprintf', args)
}

Profiling the PharmaSUG examples

@mstackhouse, we talked about the performance of Tplyr at PharmaSUG 2022, so I thought you might like to see profiling results from the examples in the hands-on training. dplyr functions, tidyr functions, and rlang::expr_deparse() look like the most noticeable findings. In addition, running the demographics example with debug(dplyr::filter), it looks like filter() is called many times on the same data. Maybe the overhead is adding up?

Demographics table

With the original data:

library(Tplyr)
library(pharmaRTF)
library(dplyr)
library(tidyr)
adsl <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adsl.xpt"))
adsl <- adsl %>%
  mutate(SEX = recode(SEX, 'M' = "Male", 'F' = "Female")) %>% 
  mutate(RACE = recode(RACE, "WHITE" = "White", "BLACK OR AFRICAN AMERICAN"="Black or African American", 
                       "AMERICAN INDIAN OR ALASKA NATIVE"="American Indian or Alaska Native")) %>% 
  mutate(ETHNIC = recode(ETHNIC, "HISPANIC OR LATINO" = "Hispanic or Latino", "NOT HISPANIC OR LATINO" = "Not Hispanic or Latino"))
adsl$RACE <- factor(adsl$RACE, levels=c("American Indian or Alaska Native", "Asian", "Black or African American", 
                                        "Native Hawaiian or Other Pacific Islander",
                                        "White", "Multiple"))
adsl$AGEGR1 <- factor(adsl$AGEGR1, levels = c("<65", "65-80", ">80"))
t <- tplyr_table(adsl, TRT01P) %>%
  add_layer(
    group_count(SEX, by = "Sex n (%)")
  ) %>% 
  add_layer(
    group_desc(AGE, by = "Age (years)")
  ) %>% 
  add_layer(
    group_count(AGEGR1, by = "Age Categories n (%)")
  ) %>% 
  add_layer(
    group_count(RACE, by = "Race n (%)")
  ) %>%
  add_layer(
    group_count(ETHNIC, by = "Ethnicity n(%)")
  ) %>%
  add_layer(
    group_desc(WEIGHTBL, by = "Baseline Weight (kg)")
  )

proffer::pprof(build(t))

dm-original

With 500 times more rows:

# ...
big <- bind_rows(replicate(500, adsl, simplify = FALSE))
t <- tplyr_table(big, TRT01P) %>%
# ...

proffer::pprof(build(t))

dm

Adverse events table

library(Tplyr)
library(pharmaRTF)
library(dplyr)
library(tidyr)
adsl <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adsl.xpt"))
adae <- haven::read_xpt(url("https://github.com/phuse-org/TestDataFactory/raw/main/Updated/TDF_ADaM/adae.xpt"))
t <- tplyr_table(adae, TRTA, where = SAFFL == "Y") %>% 
  set_pop_data(adsl) %>% 
  set_pop_treat_var(TRT01A) %>% 
  add_total_group() %>% 
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>%
      set_distinct_by(USUBJID) %>%
      set_nest_count(TRUE) %>%
      set_format_strings(f_str("xx (xx.x%) [x]", distinct_n, distinct_pct, n)) %>%
      set_order_count_method("bycount", break_ties = 'desc') %>%
      set_ordering_cols("Xanomeline High Dose") %>%
      set_result_order_var(distinct_n)
  )

proffer::pprof(build(t))

ae

Session info

R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_1.2.0     dplyr_1.0.9     pharmaRTF_0.1.4 Tplyr_0.4.4    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3     pillar_1.7.0     compiler_4.2.0   forcats_0.5.1   
 [5] tools_4.2.0      huxtable_5.4.0   profile_1.0.2    lifecycle_1.0.1 
 [9] tibble_3.1.7     pkgconfig_2.0.3  rlang_1.0.2      rstudioapi_0.13 
[13] DBI_1.1.2        cli_3.3.0        haven_2.5.0      xfun_0.31       
[17] withr_2.5.0      stringr_1.4.0    knitr_1.39       generics_0.1.2  
[21] vctrs_0.4.1      hms_1.1.1        tidyselect_1.1.2 glue_1.6.2      
[25] RProtoBuf_0.4.19 R6_2.5.1         processx_3.5.3   fansi_1.0.3     
[29] pingr_2.0.1      purrr_0.3.4      readr_2.1.2      tzdb_0.3.0      
[33] proffer_0.1.5    magrittr_2.0.3   ps_1.7.0         ellipsis_0.3.2  
[37] assertthat_0.2.1 utf8_1.2.2       stringi_1.7.6    crayon_1.5.1 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.