Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hello again <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

If performance is an issue we can use data.table magic: <div class="highlight high

Tables with frequency for groups of records about expss HOT 10 CLOSED

gdemin commented on May 26, 2024

Tables with frequency for groups of records

from expss.

Comments (10)

gdemin commented on May 26, 2024

Hi @robertogilsaura,
The code below gives me the same result as in your example:

library(expss)
data <- read_spss("prueba.sav")
data %>%
  tab_cols(total()) %>%
  tab_cells(VAR2) %>%
  tab_stat_cases() %>%
  tab_pivot()

  library(expss)
  data <- read_spss("prueba.sav")
  # first variable is gender, second is group
  count_groups = function(var_group){
    # we take unique records to avoid count gender multiple times in the same group
    var_group = unique(var_group)
    # set the same label for group as for first variable to position total in the table
    var_lab(var_group[[2]]) = var_lab(var_group[[1]])
    # we calculate total separately  because we need to count distinct groups
    rbind(
      # count cases
      cro(list(var_group[[1]]), 
          total_row_position = "none"), 
      # calculate total
      cro(
        total(
          unique(var_group[[2]]), label = "#Total"
        ), 
        total_row_position = "none"
      )  
    )
  }
  
  data %>%
    tab_cols(total()) %>%
    tab_cells(data.frame(VAR2, VAR3)) %>%
    tab_stat_fun_df(count_groups) %>%
    tab_pivot()
  # |        |        | #Total |
  # | ------ | ------ | ------ |
  # | Gender |   male |      3 |
  # |        | female |      2 |
  # |        | #Total |      3 |

from expss.

robertogilsaura commented on May 26, 2024

Hi, @gdemin.

Code runs properly. I have tested with my real dataframe and output (10423 records with 2559 groups) is the same with my old software . I tested with other VAR in cols, and output is ok, too.

Thanks for the excellent package, but above all for the excellent attention and personal support.

from expss.

gdemin commented on May 26, 2024

Hi, @robertogilsaura,
Just pure curiosity - what is a name of your old software? Your task looks very specific for me. In the past I made all my tables with SPSS and I don't know how to easily calculate such tables with it without additional aggregation.

from expss.

robertogilsaura commented on May 26, 2024

😉 My old software (ancient software...) is Barbro (MSDOS) / Barbwin (Windows). Since 1984, I'm working at TESI (company developer and owner of BarbWin software) and in the University of Valencia. At TESI, our customers are Institutes and Agencies of Market Research and at UV, I teach and learn with young people data processing and market research. Barbwin was always an annoying little enemy for SPSS in Latin America and in Spain, because our clients always used Barbwin to tabulate their surveys (TNS, Millward Brown, Nielsen, GFK, ... etc) are our clients both in Spain and in LATAM. But time passes, and BarbWin for many reasons has become obsolete and we are in other projects. I only got 1 year with R and it's only been 6 months since I introduced my students to this exciting world of R. *You must feel very proud of this package (expss)*, It is truly the best crosstables package in R with a lot of difference and it is allowing me to transfer projects to work online with our new SegmentaNet platform for visualization and data processing. Very thankful ... @robertogilsaura .-.-.-.-.- *Roberto Gil Saura ** <https://www.facebook.com/robertogilsaura> <https://twitter.com/robertogilsaura> <https://www.linkedin.com/in/robertogilsaura/> * * <https://www.investigaonline.com>* --- Este mensaje y los ficheros anexos que, en su caso, incorpore pueden contener información confidencial o de uso exclusivamente interno de investigaonline.com, sujeto a secreto profesional. Si usted lo recibe por error o considera que no es el destinatario del mismo, debe destruirlo y comunicar al emisor esta circunstancia, no pudiendo en ningún caso distribuir, divulgar ni copiar el mensaje. El vie., 19 jul. 2019 a las 9:31, Gregory Demin (<[email protected]>) escribió:

…

Hi, @robertogilsaura <https://github.com/robertogilsaura>, Just pure curiosity - what is a name of your old software? Your task looks very specific for me. In the past I made all my tables with SPSS and I don't know how to easily calculate such tables with it without additional aggregation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33?email_source=notifications&email_token=ACWGQP6NR6E7PHSGEKNTSTTQAFUTJA5CNFSM4IETN2SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2K2XBI#issuecomment-513125253>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACWGQPZGIS3BPCRKAD2WOI3QAFUTJANCNFSM4IETN2SA> .

from expss.

robertogilsaura commented on May 26, 2024

Hello again @gdemin

I need to make a variation, on the code you gave me before, but I have not been able to obtain it.

I need to make a table where for each group the maximum value of one variable is obtained, and then the table shows the sum of those maximums with respect to another variable.

In this case, they are students who attend courses in different centers. There are two centers and each center consists of 2 classrooms. To know the number of students that have been trained, I need to take the maximum of each classroom, because it can happen that one session was attended by 100% of students and another 80%, being the same classroom in the same center.

My code for reproducing is this... Ouput is not desired.

library(expss)
centro <- c(1,1,1,1,2,2,2)
aula <-c ("1a","1a","1b","1b","2a","2b","2a")
alumnos <- c(50,50,25,25,100,10,80)
data <- data.frame(centro,aula,alumnos)

bsum.imax = function(dfs) #between groups sum, intragroups max
    {
    # dfs - data.frame
    # first column - value
    # all other columns - object idgroup, it will be centroaula in our case
    # we should reference data.frame column by number because at runtime it will be unknown labels of the variables
    setNames(sum(max(dfs[[2]]), na.rm = TRUE), colnames(dfs)[2]) # here we set name on the result
    }
data %>%
    tab_cols(total(), centro) %>%
    tab_cells(data.frame(aula,alumnos)) %>% # note the data.frame with two variables
    tab_stat_fun_df(bsum.imax) %>%
    tab_pivot() %>% 
    drop_rc() %>% 
    t(.)

Output is ..

 |        |    | alumnos |
 | ------ | -- | ------- |
 | #Total |    |     100 |
 | centro |  1 |      50 |
 |        |  2 |     100 |

But, my desires output is (max1a(50) + max1b(25)=75, max2a(100)+max2b(10)=110)

 |        |    | alumnos |
 | ------ | -- | ------- |
 | #Total |    |     185 |
 | centro |  1 |      75 |
 |        |  2 |     110 |

I have read the "tables" function and I have not been able to find the way to perform the calculation with tab_stat_fun_df () or with another function.

Thanks in advance.
Robert

from expss.

robertogilsaura commented on May 26, 2024

I found a solution, but I don't know if it will be the most appropriate. I think that with multiple variables in the table it would present problems.

library(expss)
centro <- c(1,1,1,1,2,2,2)
aula <-c ("1a","1a","1b","1b","2a","2b","2a")
alumnos <- c(50,50,25,25,100,10,80)
data <- data.frame(centro,aula,alumnos)

dfs <- data %>% group_by(centro,aula)
dfs <- dfs %>% summarise(alumnos=max(alumnos))

dfs %>%
    tab_cols(total(), centro) %>%
    tab_cells(alumnos) %>%
    tab_stat_sum() %>%
    tab_pivot() %>% 
    drop_rc() %>% 
    t(.)

Output ...

 |        |    | alumnos |
 |        |    |     Sum |
 | ------ | -- | ------- |
 | #Total |    |     185 |
 | centro |  1 |      75 |
 |        |  2 |     110 |

Any suggestions for improvement?

Thanks in advance.
Robert.

from expss.

gdemin commented on May 26, 2024

Hi, @robertogilsaura
In your original version you only take one maximum for all groups. We need to calculate maximum for each alumnos inside bsum.imax:

library(expss)
centro <- c(1,1,1,1,2,2,2)
aula <-c ("1a","1a","1b","1b","2a","2b","2a")
alumnos <- c(50,50,25,25,100,10,80)
data <- data.frame(centro,aula,alumnos)

bsum.imax = function(dfs) #between groups sum, intragroups max
{
    # dfs - data.frame
    # first column - value
    # all other columns - object idgroup, it will be centroaula in our case
    # we should reference data.frame column by number because at runtime it will be unknown labels of the variables
    maxes = tapply(dfs[[2]], dfs[[1]], FUN = max, na.rm = TRUE) # get the maximum from each group
    setNames(sum(maxes, na.rm = TRUE), colnames(dfs)[2]) # here we set name on the result
}
data %>%
    tab_cols(total(), centro) %>%
    tab_cells(data.frame(aula,alumnos)) %>% # note the data.frame with two variables
    tab_stat_fun_df(bsum.imax) %>%
    tab_pivot() %>% 
    drop_rc() %>% 
    t(.)

from expss.

robertogilsaura commented on May 26, 2024

Hi @gdemin,

Thank you very much for your input. I think your solution is more appropriate, as it avoids loading dplyr. However, the response time in large datasets (about 10,000 records) I have seen is very important.

Anyway, it is true that these types of tables are not very common, so in normal datasets, I prefer not to load supplementary packages.

Thank you very much again.

from expss.

gdemin commented on May 26, 2024

If performance is an issue we can use data.table magic:

bsum.imax = function(dfs) #between groups sum, intragroups max
{
    # dfs - data.frame
    # first column - value
    # all other columns - object idgroup, it will be centroaula in our case
    # we should reference data.frame column by number because at runtime it will be unknown labels of the variables
    varname = colnames(dfs)[1]
    label = colnames(dfs)[2]
    maxes = dfs[, lapply(.SD, max, na.rm = TRUE), by = eval(varname)][[2]] # get the vector with maximums from each group
    setNames(sum(maxes, na.rm = TRUE), label) # here we set name on the result
}

from expss.

robertogilsaura commented on May 26, 2024

Wow magic, magic!!! the performance has improved a lot.

Reduce from 135 seconds - tapply () - to 5 seconds - lapply () -.

Thank you very much again. I have a lot to understand and still learn.

from expss.

Tables with frequency for groups of records about expss HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent