I found an incorrectly duplicated id_col in master_data.csv for two separate analyses from the same submission for the same team, one blue tit and one eucalyptus. One will need to be recoded in response_id, submission_id and analysis_id and split_id columns.
See details in reprex below:
library(tidyverse)
library(here)
#> here() starts at /Users/elliotgould/Documents/GitHub/ManyAnalysts
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
library(ManyEcoEvo)
prepare_df_for_summarising <- function(data){
data %>% mutate(across(.cols = c(num_fixed_variables,
num_random_variables,
sample_size,
num_interactions,
Bayesian, #NA's coming from CHECK values
mixed_model,
num_fixed_effects,
num_random_effects),
as.numeric),
lm = ifelse(linear_model == "linear", 1, 0),
glm = ifelse(linear_model == "generalised", 1, 0))
}
Master <- ManyEcoEvo %>%
select(data) %>% unnest(everything()) %>%
prepare_df_for_summarising() #NAs ok, caused by CHECK vals, not yet using THP's fixes
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(...)`.
#> Caused by warning:
#> ! NAs introduced by coercion
Note that we are getting an unexpected many to many relationship here, as per the warning above.
predictions <- read_csv(here::here("ms/predictions_Ids.csv")) %>% #TODO ask HF source
distinct() %>%
left_join(Master, by = c("id_col")) %>%
prepare_df_for_summarising()
#> Rows: 258 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): id_col
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There are duplicate entries for one id_col, let’s identify these analyses:
predictions %>% janitor::get_dupes("id_col") %>%
select(id_col, ends_with("_id"), TeamIdentifier) %>%
knitr::kable()
id_col |
response_id |
submission_id |
analysis_id |
split_id |
TeamIdentifier |
Byrock-1-8-1 |
R_3qfD5ZHHdBbTgk3 |
1 |
8 |
1 |
Byrock |
Byrock-1-8-1 |
R_3HzSBqQTAmJJ9ye |
1 |
8 |
1 |
Byrock |
It seems that there are two separate response_id entries for this Team,
However, they are both coded with the same id_col.
let’s see which columns have values that are duplicated:
duplicated_variables <-
predictions %>% select(-review_data) %>%
janitor::get_dupes("id_col") %>%
summarise(id_col = unique(id_col), across(-all_of("id_col"),
~ first(.x) == last(.x))) %>%
select(id_col, where(isFALSE))
predictions %>%
semi_join(duplicated_variables, by = join_by("id_col")) %>%
select(id_col, colnames(duplicated_variables)) %>%
knitr::kable()
id_col |
response_id |
beta_estimate |
adjusted_df |
beta_SE |
transformation |
link_function_reported |
dataset |
mixed_model |
response_variable_name |
response_id_S2 |
sample_size |
linear_model |
exclusions_all |
Conclusion |
lm |
glm |
Byrock-1-8-1 |
R_3qfD5ZHHdBbTgk3 |
-0.065490 |
458.3576 |
0.014100 |
identity |
identity |
blue tit |
1 |
day_14_weight |
R_3qfD5ZHHdBbTgk3 |
3720 |
linear |
exclude |
neg_c |
1 |
0 |
Byrock-1-8-1 |
R_3HzSBqQTAmJJ9ye |
-0.028464 |
345.0000 |
0.025721 |
log |
log |
eucalyptus |
0 |
euc_sdlgs0_50cm |
R_3HzSBqQTAmJJ9ye |
350 |
generalised |
retain |
neg_q |
0 |
1 |
OK there is one for both Eucalyptus and for Blue tit, So the split_id
is coded incorrectly as these are clearly separate analyses.
I can see that this id is also assigned to different response_id’s, i.e. from different submissions.
I note that in the file prediction_IDs.csv
there are three duplicated entries for this id_col
.
We should make sure that there isn’t a third analysis somewhere that is also duplicated in id_col
.
Would be helpful to know how Hannah created this dataset.
OK, I also note that for response_id R_3HzSBqQTAmJJ9ye
There are three entries in
predictions_validations_worksheet.csv
belonging to this response_id
. So that’s why there are multiple
entries in predictions_IDs.csv
.
The submission, analysis and split ID columns in that data file are:
1-8-1
2-9-1
3-10-1
The predictions object here is created also from the Master
object or ManyEcoEvo::ManyEcoEvo
.
Which comes from the master_data.csv
file.
Let’s look at that to see if that’s potentially the source of the problem:
Master %>%
filter(TeamIdentifier == "Byrock") %>%
select(id_col, dataset, all_of(ends_with("_id"))) %>%
distinct() %>%
janitor::get_dupes("id_col")
#> # A tibble: 2 × 7
#> id_col dupe_count dataset response_id submission_id analysis_id split_id
#> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Byrock-1-8-1 2 blue t… R_3qfD5ZHH… 1 8 1
#> 2 Byrock-1-8-1 2 eucaly… R_3HzSBqQT… 1 8 1
Yes, different response_id
for the same id_col
for analyses of diff. datasets.
Let’s check check the raw data file. Here’s the reprex output I ran over at ManyEcoEvo
:
```md *Local
.Rprofiledetected at
/Users/elliotgould/Documents/GitHub/ManyEcoEvo/.Rprofile`*
library(targets)
library(tidyverse)
There are no extra prediction file submissions for these analyses, so that’s not a problem.
tar_read(list_of_new_prediction_files) %>%
filter(response_id == "R_3qfD5ZHHdBbTgk3" | response_id == "R_3HzSBqQTAmJJ9ye") %>%
select(dataset, ends_with("_id"), csv_number)
#> # A tibble: 0 × 6
#> # ℹ 6 variables: dataset <chr>, response_id <chr>, submission_id <dbl>,
#> # analysis_id <dbl>, split_id <dbl>, csv_number <dbl>
Let’s check the underlying master_data
:
readr::read_csv("data-raw/anonymised_data/master_data.csv") %>%
filter(TeamIdentifier == "Byrock") %>%
select(id_col, dataset, all_of(ends_with("_id"))) %>%
knitr::kable()
#> Rows: 302 Columns: 154
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (135): response_id, id_col, contrast, transformation, link_function_repo...
#> dbl (16): submission_id, analysis_id, split_id, beta_estimate, adjusted_df,...
#> lgl (3): Extra-pair_dad_ring, rear_Cs_out, rear_Cs_in
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
id_col |
dataset |
response_id |
submission_id |
analysis_id |
split_id |
hatch_nest_breed_ID |
rear_nest_breed_ID |
Byrock-1-3-1 |
eucalyptus |
R_23UKvhBc7D608VO |
1 |
3 |
1 |
NA |
NA |
Byrock-5-7-1 |
eucalyptus |
R_23UKvhBc7D608VO |
5 |
7 |
1 |
NA |
NA |
Byrock-4-6-1 |
eucalyptus |
R_23UKvhBc7D608VO |
4 |
6 |
1 |
NA |
NA |
Byrock-2-4-1 |
eucalyptus |
R_23UKvhBc7D608VO |
2 |
4 |
1 |
NA |
NA |
Byrock-3-5-1 |
eucalyptus |
R_23UKvhBc7D608VO |
3 |
5 |
1 |
NA |
NA |
Byrock-3-10-1 |
eucalyptus |
R_3HzSBqQTAmJJ9ye |
3 |
10 |
1 |
NA |
NA |
Byrock-1-8-1 |
eucalyptus |
R_3HzSBqQTAmJJ9ye |
1 |
8 |
1 |
NA |
NA |
Byrock-2-9-1 |
eucalyptus |
R_3HzSBqQTAmJJ9ye |
2 |
9 |
1 |
NA |
NA |
Byrock-1-1-1 |
blue tit |
R_3iKJrflQwwxsps0 |
1 |
1 |
1 |
NA |
rear_nest_breed_ID |
Byrock-2-2-1 |
blue tit |
R_3iKJrflQwwxsps0 |
2 |
2 |
1 |
NA |
rear_nest_breed_ID |
Byrock-1-8-1 |
blue tit |
R_3qfD5ZHHdBbTgk3 |
1 |
8 |
1 |
NA |
rear_nest_breed_ID |
Yes, this must be the source of the issue. Two 1-8-1 entries.
Created on 2024-06-18 with reprex v2.1.0
Created on 2024-06-18 with reprex v2.1.0