Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Issue with lookup_estimates and mse about verdata HOT 6 CLOSED

mduranf commented on June 14, 2024

Issue with lookup_estimates and mse

from verdata.

Comments (6)

thegargiulian commented on June 14, 2024

I'll take a look at this this week and will report back!

from verdata.

thegargiulian commented on June 14, 2024

Hi @mduranf! I finally had some time to dive into this. In short, I don't think that there are any issues with lookup_estimates(...) or mse(...), but rather that the example under study just included a lot of strata that we didn't estimate, hence mse(...) takes a lot of time to run.

Here's a reprex that makes use of the new function estimates_exist(...) to illustrate this point:

# make sure that newest version of verdata is installed (for new function
    # `estimates_exist`; last updated on 18 July 2023)
    pacman::p_load(verdata, dplyr, stringr, purrr)
    
    
    stratify <- function(replicate_data, schema) {
        
        schema_list <- unlist(str_split(schema, pattern = ","))
        
        grouped_data <- replicate_data %>%
            group_by(!!!syms(schema_list))
        
        stratification_vars <- grouped_data %>%
            group_keys() %>%
            group_by_all() %>%
            group_split()
        
        split_data <- grouped_data %>%
            group_split(.keep = FALSE)
        
        return(list(strata_data = split_data,
                    stratification_vars = stratification_vars))
        
    }
    
    # setup test data
    desaparicion <- read_replicates("desaparicion-parquet", violation = "desaparicion", 1, 10)
    
    schema <- ("replica,yy_hecho,is_forced_dis") # stratification where estimates should exist
    
    estratificacion <- stratify(desaparicion, schema)
    
    # use the new function `estimates_exist` to identify which estimates already exist
    # and which don't.
    
    # start with a test on a single stratum
    estimates_exist(estratificacion$strata_data[[2]] %>% select(starts_with("in_")),
                    estimates_dir = "estimates")
#> $estimates_exist
#> [1] TRUE
#> 
#> $estimates_path
#> estimates/eb/ebffd3c47ef95808977418b561222233ef432383.json
    
    
    # let's write an auxiliary function we can apply to check all of the strata
    
    check_estimates <- function(stratum_data, stratification_vars, estimates_dir) {
        
        stratification_vars <- paste0(stratification_vars, collapse = "-")
        
        stratum_data <- stratum_data %>%
            select(starts_with("in_"))
        
        lookup <- estimates_exist(stratum_data, estimates_dir)
        
        results <- tibble(estimates_exist = lookup$estimates_exist,
                          stratum_name = stratification_vars)
        
        return(results)
        
    }
    
    
    # apply to example from before
    check_estimates(estratificacion$strata_data[[2]],
                    estratificacion$stratification_vars[[2]],
                    estimates_dir = "estimates")
#> # A tibble: 1 × 2
#>   estimates_exist stratum_name
#>   <lgl>           <chr>       
#> 1 TRUE            R1-1985-TRUE
    
    # now apply to all strata; this takes a few seconds to run for all the strata
    lookup_results <- map2_dfr(.x = estratificacion$strata_data,
                               .y = estratificacion$stratification_vars,
                               .f = ~check_estimates(stratum_data = .x,
                                                     stratification_vars = .y,
                                                     estimates_dir = "estimates"))
    
    head(lookup_results)
#> # A tibble: 6 × 2
#>   estimates_exist stratum_name 
#>   <lgl>           <chr>        
#> 1 FALSE           R1-1985-FALSE
#> 2 TRUE            R1-1985-TRUE 
#> 3 FALSE           R1-1986-FALSE
#> 4 FALSE           R1-1986-TRUE 
#> 5 FALSE           R1-1987-FALSE
#> 6 TRUE            R1-1987-TRUE
    
    table(lookup_results$estimates_exist)
#> 
#> FALSE  TRUE 
#>   495   145
    
    # many strata don't exist, particularly those that refer to disappearances that
    # are not forced disappearances. running `mse` on this data is going to take 
    # a long time because there are nearly 500 strata that need to be estimated
    
    # for strata that do exist, however, `lookup_estimates()` works as indended
    # and the results are basically instantaneous
    mse(stratum_data = estratificacion$strata_data[[2]],
        stratum_name = paste0(estratificacion$stratification_vars[[2]], collapse = "-"),
        estimates_dir = "estimates")
#> # A tibble: 1,000 × 5
#>    validated     N valid_sources                              n_obs stratum_name
#>    <lgl>     <dbl> <chr>                                      <int> <chr>       
#>  1 TRUE       2196 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  2 TRUE       2216 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  3 TRUE       2501 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  4 TRUE       2221 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  5 TRUE       2244 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  6 TRUE       2254 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  7 TRUE       2290 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  8 TRUE       2232 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  9 TRUE       2356 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#> 10 TRUE       2216 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#> # ℹ 990 more rows
    
    # this example isn't run, but we didn't do estimates for non enforced
    # disappearances so we would expect this one to take more time (if it can
    # be estimated at all)
    # mse(stratum_data = estratificacion$strata_data[[1]],
    #     stratum_name = paste0(estratificacion$stratification_vars[[1]], collapse = "-"),
    #     estimates_dir = "estimates")

^{Created on 2023-07-18 by the reprex package (v2.0.1)}

from verdata.

mduranf commented on June 14, 2024

Thanks @thegargiulian, this makes sense!

So one thing we could do, particularly for examples illustrating how to replicate results, is filter by is_forced_dis, the same way we filter by is_conflict when we read the data, and then stratify by yy_hecho and is_forced_dis normally, right?

I noticed it is still going to take time because there are 18 strata that don't exist, even though they refer to forced disappearances, and I suppose those were lost when there were issues with the server?

from verdata.

thegargiulian commented on June 14, 2024

Filtering is definitely the right thing to do and mapping that action onto a particular research question (e.g., what were the temporal patterns in enforced disappearances during the conflict?) makes it more obvious why we would do something like that instead of estimating everything.

As for the strata where estimates are not available, it's possible that those files were lost when there were server issues. Some of the strata also may not be estimable.

from verdata.

thegargiulian commented on June 14, 2024

Hi @mduranf! If you're happy with the discussion here can you close this issue, or alternatively let me know if you think there are any other issues to fix? Thanks!

from verdata.

mduranf commented on June 14, 2024

We can close it! Thanks you :)

from verdata.

Issue with lookup_estimates and mse about verdata HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent