Giter Site home page Giter Site logo

Comments (6)

thegargiulian avatar thegargiulian commented on June 14, 2024

I'll take a look at this this week and will report back!

from verdata.

thegargiulian avatar thegargiulian commented on June 14, 2024

Hi @mduranf! I finally had some time to dive into this. In short, I don't think that there are any issues with lookup_estimates(...) or mse(...), but rather that the example under study just included a lot of strata that we didn't estimate, hence mse(...) takes a lot of time to run.

Here's a reprex that makes use of the new function estimates_exist(...) to illustrate this point:

# make sure that newest version of verdata is installed (for new function
    # `estimates_exist`; last updated on 18 July 2023)
    pacman::p_load(verdata, dplyr, stringr, purrr)
    
    
    stratify <- function(replicate_data, schema) {
        
        schema_list <- unlist(str_split(schema, pattern = ","))
        
        grouped_data <- replicate_data %>%
            group_by(!!!syms(schema_list))
        
        stratification_vars <- grouped_data %>%
            group_keys() %>%
            group_by_all() %>%
            group_split()
        
        split_data <- grouped_data %>%
            group_split(.keep = FALSE)
        
        return(list(strata_data = split_data,
                    stratification_vars = stratification_vars))
        
    }
    
    # setup test data
    desaparicion <- read_replicates("desaparicion-parquet", violation = "desaparicion", 1, 10)
    
    schema <- ("replica,yy_hecho,is_forced_dis") # stratification where estimates should exist
    
    estratificacion <- stratify(desaparicion, schema)
    
    # use the new function `estimates_exist` to identify which estimates already exist
    # and which don't.
    
    # start with a test on a single stratum
    estimates_exist(estratificacion$strata_data[[2]] %>% select(starts_with("in_")),
                    estimates_dir = "estimates")
#> $estimates_exist
#> [1] TRUE
#> 
#> $estimates_path
#> estimates/eb/ebffd3c47ef95808977418b561222233ef432383.json
    
    
    # let's write an auxiliary function we can apply to check all of the strata
    
    check_estimates <- function(stratum_data, stratification_vars, estimates_dir) {
        
        stratification_vars <- paste0(stratification_vars, collapse = "-")
        
        stratum_data <- stratum_data %>%
            select(starts_with("in_"))
        
        lookup <- estimates_exist(stratum_data, estimates_dir)
        
        results <- tibble(estimates_exist = lookup$estimates_exist,
                          stratum_name = stratification_vars)
        
        return(results)
        
    }
    
    
    # apply to example from before
    check_estimates(estratificacion$strata_data[[2]],
                    estratificacion$stratification_vars[[2]],
                    estimates_dir = "estimates")
#> # A tibble: 1 × 2
#>   estimates_exist stratum_name
#>   <lgl>           <chr>       
#> 1 TRUE            R1-1985-TRUE
    
    # now apply to all strata; this takes a few seconds to run for all the strata
    lookup_results <- map2_dfr(.x = estratificacion$strata_data,
                               .y = estratificacion$stratification_vars,
                               .f = ~check_estimates(stratum_data = .x,
                                                     stratification_vars = .y,
                                                     estimates_dir = "estimates"))
    
    head(lookup_results)
#> # A tibble: 6 × 2
#>   estimates_exist stratum_name 
#>   <lgl>           <chr>        
#> 1 FALSE           R1-1985-FALSE
#> 2 TRUE            R1-1985-TRUE 
#> 3 FALSE           R1-1986-FALSE
#> 4 FALSE           R1-1986-TRUE 
#> 5 FALSE           R1-1987-FALSE
#> 6 TRUE            R1-1987-TRUE
    
    table(lookup_results$estimates_exist)
#> 
#> FALSE  TRUE 
#>   495   145
    
    # many strata don't exist, particularly those that refer to disappearances that
    # are not forced disappearances. running `mse` on this data is going to take 
    # a long time because there are nearly 500 strata that need to be estimated
    
    # for strata that do exist, however, `lookup_estimates()` works as indended
    # and the results are basically instantaneous
    mse(stratum_data = estratificacion$strata_data[[2]],
        stratum_name = paste0(estratificacion$stratification_vars[[2]], collapse = "-"),
        estimates_dir = "estimates")
#> # A tibble: 1,000 × 5
#>    validated     N valid_sources                              n_obs stratum_name
#>    <lgl>     <dbl> <chr>                                      <int> <chr>       
#>  1 TRUE       2196 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  2 TRUE       2216 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  3 TRUE       2501 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  4 TRUE       2221 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  5 TRUE       2244 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  6 TRUE       2254 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  7 TRUE       2290 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  8 TRUE       2232 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#>  9 TRUE       2356 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#> 10 TRUE       2216 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12…  1575 R1-1985-TRUE
#> # ℹ 990 more rows
    
    # this example isn't run, but we didn't do estimates for non enforced
    # disappearances so we would expect this one to take more time (if it can
    # be estimated at all)
    # mse(stratum_data = estratificacion$strata_data[[1]],
    #     stratum_name = paste0(estratificacion$stratification_vars[[1]], collapse = "-"),
    #     estimates_dir = "estimates")

Created on 2023-07-18 by the reprex package (v2.0.1)

from verdata.

mduranf avatar mduranf commented on June 14, 2024

Thanks @thegargiulian, this makes sense!

So one thing we could do, particularly for examples illustrating how to replicate results, is filter by is_forced_dis, the same way we filter by is_conflict when we read the data, and then stratify by yy_hecho and is_forced_dis normally, right?

I noticed it is still going to take time because there are 18 strata that don't exist, even though they refer to forced disappearances, and I suppose those were lost when there were issues with the server?

from verdata.

thegargiulian avatar thegargiulian commented on June 14, 2024

Filtering is definitely the right thing to do and mapping that action onto a particular research question (e.g., what were the temporal patterns in enforced disappearances during the conflict?) makes it more obvious why we would do something like that instead of estimating everything.

As for the strata where estimates are not available, it's possible that those files were lost when there were server issues. Some of the strata also may not be estimable.

from verdata.

thegargiulian avatar thegargiulian commented on June 14, 2024

Hi @mduranf! If you're happy with the discussion here can you close this issue, or alternatively let me know if you think there are any other issues to fix? Thanks!

from verdata.

mduranf avatar mduranf commented on June 14, 2024

We can close it! Thanks you :)

from verdata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.