Comments (6)
I'll take a look at this this week and will report back!
from verdata.
Hi @mduranf! I finally had some time to dive into this. In short, I don't think that there are any issues with lookup_estimates(...)
or mse(...)
, but rather that the example under study just included a lot of strata that we didn't estimate, hence mse(...)
takes a lot of time to run.
Here's a reprex that makes use of the new function estimates_exist(...)
to illustrate this point:
# make sure that newest version of verdata is installed (for new function
# `estimates_exist`; last updated on 18 July 2023)
pacman::p_load(verdata, dplyr, stringr, purrr)
stratify <- function(replicate_data, schema) {
schema_list <- unlist(str_split(schema, pattern = ","))
grouped_data <- replicate_data %>%
group_by(!!!syms(schema_list))
stratification_vars <- grouped_data %>%
group_keys() %>%
group_by_all() %>%
group_split()
split_data <- grouped_data %>%
group_split(.keep = FALSE)
return(list(strata_data = split_data,
stratification_vars = stratification_vars))
}
# setup test data
desaparicion <- read_replicates("desaparicion-parquet", violation = "desaparicion", 1, 10)
schema <- ("replica,yy_hecho,is_forced_dis") # stratification where estimates should exist
estratificacion <- stratify(desaparicion, schema)
# use the new function `estimates_exist` to identify which estimates already exist
# and which don't.
# start with a test on a single stratum
estimates_exist(estratificacion$strata_data[[2]] %>% select(starts_with("in_")),
estimates_dir = "estimates")
#> $estimates_exist
#> [1] TRUE
#>
#> $estimates_path
#> estimates/eb/ebffd3c47ef95808977418b561222233ef432383.json
# let's write an auxiliary function we can apply to check all of the strata
check_estimates <- function(stratum_data, stratification_vars, estimates_dir) {
stratification_vars <- paste0(stratification_vars, collapse = "-")
stratum_data <- stratum_data %>%
select(starts_with("in_"))
lookup <- estimates_exist(stratum_data, estimates_dir)
results <- tibble(estimates_exist = lookup$estimates_exist,
stratum_name = stratification_vars)
return(results)
}
# apply to example from before
check_estimates(estratificacion$strata_data[[2]],
estratificacion$stratification_vars[[2]],
estimates_dir = "estimates")
#> # A tibble: 1 × 2
#> estimates_exist stratum_name
#> <lgl> <chr>
#> 1 TRUE R1-1985-TRUE
# now apply to all strata; this takes a few seconds to run for all the strata
lookup_results <- map2_dfr(.x = estratificacion$strata_data,
.y = estratificacion$stratification_vars,
.f = ~check_estimates(stratum_data = .x,
stratification_vars = .y,
estimates_dir = "estimates"))
head(lookup_results)
#> # A tibble: 6 × 2
#> estimates_exist stratum_name
#> <lgl> <chr>
#> 1 FALSE R1-1985-FALSE
#> 2 TRUE R1-1985-TRUE
#> 3 FALSE R1-1986-FALSE
#> 4 FALSE R1-1986-TRUE
#> 5 FALSE R1-1987-FALSE
#> 6 TRUE R1-1987-TRUE
table(lookup_results$estimates_exist)
#>
#> FALSE TRUE
#> 495 145
# many strata don't exist, particularly those that refer to disappearances that
# are not forced disappearances. running `mse` on this data is going to take
# a long time because there are nearly 500 strata that need to be estimated
# for strata that do exist, however, `lookup_estimates()` works as indended
# and the results are basically instantaneous
mse(stratum_data = estratificacion$strata_data[[2]],
stratum_name = paste0(estratificacion$stratification_vars[[2]], collapse = "-"),
estimates_dir = "estimates")
#> # A tibble: 1,000 × 5
#> validated N valid_sources n_obs stratum_name
#> <lgl> <dbl> <chr> <int> <chr>
#> 1 TRUE 2196 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 2 TRUE 2216 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 3 TRUE 2501 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 4 TRUE 2221 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 5 TRUE 2244 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 6 TRUE 2254 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 7 TRUE 2290 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 8 TRUE 2232 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 9 TRUE 2356 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> 10 TRUE 2216 in_1,in_4,in_5,in_6,in_7,in_8,in_11,in_12… 1575 R1-1985-TRUE
#> # ℹ 990 more rows
# this example isn't run, but we didn't do estimates for non enforced
# disappearances so we would expect this one to take more time (if it can
# be estimated at all)
# mse(stratum_data = estratificacion$strata_data[[1]],
# stratum_name = paste0(estratificacion$stratification_vars[[1]], collapse = "-"),
# estimates_dir = "estimates")
Created on 2023-07-18 by the reprex package (v2.0.1)
from verdata.
Thanks @thegargiulian, this makes sense!
So one thing we could do, particularly for examples illustrating how to replicate results, is filter by is_forced_dis
, the same way we filter by is_conflict
when we read the data, and then stratify by yy_hecho
and is_forced_dis
normally, right?
I noticed it is still going to take time because there are 18 strata that don't exist, even though they refer to forced disappearances, and I suppose those were lost when there were issues with the server?
from verdata.
Filtering is definitely the right thing to do and mapping that action onto a particular research question (e.g., what were the temporal patterns in enforced disappearances during the conflict?) makes it more obvious why we would do something like that instead of estimating everything.
As for the strata where estimates are not available, it's possible that those files were lost when there were server issues. Some of the strata also may not be estimable.
from verdata.
Hi @mduranf! If you're happy with the discussion here can you close this issue, or alternatively let me know if you think there are any other issues to fix? Thanks!
from verdata.
We can close it! Thanks you :)
from verdata.
Related Issues (20)
- Review test warnings
- README contains unnecessary instructions
- Make data dictionary available as data frame in package
- Add plain text license
- Provide guidance on how to download data HOT 1
- Tests are not replicable HOT 4
- Improve title for combine_estimates
- Provide alternate Rd title for `get_valid_sources()` HOT 1
- Fix documentation for estimates_exist
- Fix documentation titles
- Move part of package description to README
- Installation instructions: Check if `devtools` is installed
- Provide examples as vignettes HOT 1
- Provide contribution instructions
- Set up continuous integration HOT 4
- Function to automatically download data from DANE website
- [Doc] add example for estimates_exist()
- Updates to DESCRIPTION file
- replace `\dontrun` with `\donttest`
- Remove `options(warn=-1)`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from verdata.