Giter Site home page Giter Site logo

epiverse-trace / cfr Goto Github PK

View Code? Open in Web Editor NEW
13.0 10.0 2.0 18.68 MB

R package to estimate disease severity and under-reporting in real-time, accounting for reporting delays in epidemic time-series

Home Page: https://epiverse-trace.github.io/cfr/

License: Other

R 100.00%
r r-package case-fatality-rate epidemiology health-outcomes epidemic-modelling outbreak-analysis epiverse sdg-3

cfr's People

Contributors

actions-user avatar adamkucharski avatar avallecam avatar bisaloo avatar joshwlambert avatar pratikunterwegs avatar thimotei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cfr's Issues

Estimation of under-ascertained cases from deaths

Add functionality that can estimate cases from reported deaths and a known CFR. Starting point could be the estimator in scale_cfr function (as was done in early version of CMMID COVID reports). Follow-up functionality could account for varying reporting over time, as implemented by Russell et al, 2020 – might be useful to compare with similar functionality in packages like EpiNow2 and coarseDataTools, if such methods are available, and data formats are same (e.g. cases & deaths over time).

Comparison with other methods

In vignette, would be useful to show comparisons with estimation in EpiNow2 and coarseDataTools, noting any differences in required data (e.g. cases & deaths vs recoveries & deaths). Could also build comparisons into episoap template for CFR estimation, along with calls to epiparameter and other relevant packages.

Fix dependencies on {epiparameter}

This issue is to request that {datadelay} be fixed to take into account recent changes to {epiparameter}, which are causing all workflows on main to fail. The last check on the PR now merged into main was before these changes, which meant that the issues were not picked up at the time.

The following functions depend on {epiparameter} indirectly, as they expect an argument delay_pmf, which was intended to be a PMF extracted from an epidist object:

  1. known_outcomes(),
  2. static_cfr(),
  3. rolling_cfr()

The expected solution would be for these functions to instead accept an epidist object as an argument; the argument would also ideally be renamed to epi_dist or similar ("epi_dist" is preferred as it also indicates the class of the input).

Furthermore, tests for these functions also depend on the availability of an Ebola virus disease onset to death delay distribution in {epiparameter}. The solution to this would eventually be the inclusion of this distribution in {epiparameter}, but an adequate solution would be to use a manually defined epidist in the tests; distribution parameters would be taken from doi.org/10.1016/S0140-6736(18)31387-4..

Rename package

This issue is to suggest a change of name for the package to match the functionality included in it.

Options include:

  1. {cfr}: The name {cfr} is available, and this package could add more CFR estimation methods such as that in https://pubmed.ncbi.nlm.nih.gov/16076827/
  2. {casefatality}: Also available, perhaps more descriptive name for people in the field.
  3. Please suggest others.

Add data for examples to package

This issue is to request that Covid data used in the examples should be included in the package, rather than being downloaded via an API or taken from a data package. The data should also be used in the package vignettes and function examples.

This stems from the issue that these packages are either not on CRAN, or not actively developed, or both. This is the case for {covidregionaldata} and {owidR}. {owidR} was on CRAN but has been removed as of 8th August 2023. This reopens the issue raised in #61.

Refactor estimate_severity()

This issue is to report a possible issue with estimate_severity().

Issue: The function description says that the severity is calculated using the cases with known outcomes, which should usually be more than the reported deaths (as some cases will eventually result in deaths). But the function uses total deaths, not total known outcomes, in most cases - am I missing something here?

There are also instances of multiplying u_t * total_cases, but u_t = total_outcomes/total_cases - does this make sense?

Odd results in ascertainment plots in `estimate_ascertainment.Rmd` vignette

Looking through the processed version of the updated vignette, something looks off with the ascertainment plots – they're showing 100% for many countries, which isn't plausible (and I don't think matches the previous version of this plot?)

ascertainment

In the Rmd code, the ascertainment estimation seems to work OK if "United Kingdom" is replaced with another country in the single country example at the top, so seems like it's an issue in the nesting step?

Also, I noticed estimate_ascertainment() is returning error for a missing get_default_burn_in() function (although estimate_ascertainment still outputs a value) – get_default_burn_in is in man but not R so possibly removed in earlier commit?

Error message for missing dates (`estimate_static`)

I was getting the following error message when using datadelay's estimate_static function:

Error in data.frame(severity_me = severity_me, severity_lo = severity_lims[[1]], : arguments imply differing number of rows: 0, 1

Looking into it, @adamkucharski helped me figure out that this was because the dataset I was using with cases and deaths was accidentally missing some death dates (as a result of a previous data cleaning step), and the function requires data on each day without skipping any of the dates.

From a user perspective, I believe it would be useful to include this requirement in the documentation more clearly so that users make sure their dataset is in the correct format, and most importantly to include a more informative error message so that users can easily fix the mistake.

Standardise data formatting

If implementing and comparing different methods for estimation, need to ensure input and output data formats consistent, e.g. vectors of dates, cases and deaths. Tests also needed to check correct format (e.g. for date and numeric vector)

Generalise CFR plotting

The proposed plotting command assumes two CFRs being plotted together. We might want to make functionality more general in future (e.g. what if want to compare 3 CFR estimation methods)?

Originally posted by @adamkucharski in #11 (comment)

Smoothing option for CFR

Daily case data often exhibits cyclic variation (e.g. day-of-week effects) so worth adding option for smoothing either in the delay functions or as pre-processing step.

Add package logo

This issue it to request that the package logo be added as an SVG file. The logo and name are still liable to change based on PR #47.

Simulation recovery

Would be useful to have some simple simulation recovery fuctionality, both for a small stochastic outbreak and a large epidemic, tested at different points (e.g. early rising stage, post-peak etc.)

Update Readme

This issue is to request that the Readme should be updated to reflect new package functionality, as the current Readme is out of date. This links to issue #34 and can be combined into the same PR.

Test estimate_time_varying()

This issue is to request that the estimate_time_varying() function should be tested, with reference to the potentially anomalous behaviour reported here:

Worth double-checking this function (maybe add a test?) as it's currently returning CFR=100% early on. Estimated CFR may well have been very high given most cases detected were severe, but may be an issue with the burn-in period used rather than reporting.

Originally posted by @adamkucharski in #23 (comment)

Use CSL JSON references

This issue is to request that the references stored as a Bibtex file be stored instead as a CSL JSON file, so as to not affect the package language statistics.

Add vignette describing user options

There are a few user options that are likely to be useful for estimation/comparison:

  • Fixed CFR (i.e. 'total deaths'/'total expected cases with known outcome') vs time-varying CFR (i.e. 'deaths on day X'/'expected cases with known outcome on day X')

  • Whole timeseries (i.e. 'total deaths'/'expected cases with known outcome') vs expanding window (i.e. 'total deaths up to day X'/'total expected cases with known outcome up to day X')

  • Small numbers of events (e.g. Ebola 1976) vs large numbers of events (e.g. COVID)

  • Raw vs smoothed timeseries (i.e. effectively implementing an observation model if reporting cyclical or noisy). Possibly something to implement in a separate package, as useful across package?

  • Efficiency (e.g. parallelisation) across countries/time periods

Allow tracking pkgdown/

This issue is to flag that we may want to add the pkgdown/ directory to tracking in the future.

This folder can contain important files we want to commit. Do you have any issues that lead you to add it here?

Originally posted by @Bisaloo in #54 (comment)

Correct grouping variables in estimate_severity

This issue is to request correction of the "group_by" argument in estimate_severity() and upstream wrapper functions. This argument is ambiguous and doesn't behave as users might expect from the better known dplyr::group_by().

{epiparameter} integration

I wonder if user-facing integration with {epiparameter} should be achieved in a similar way to the work in superspreading. Here I believe @joshwlambert has moved towards the epidist object being an optional argument. I think this is a nice as it does not force the use of epidist objects on the user.

Note this is not a question of dependencies ({cfr} would still need to import {epiparameter} for internal use) but of API design and consistency across the various packages.

This likely warrants a wider discussion but raising here initially.

Add lifecycle badge

The package is currently highly unstable and should be communicated with a lifecycle badge.

Adding forecasting functionality

Once we have an estimate of CFR, which based on current estimation methods will typically run up to the most recent death, it would be possible to generate a forecast forward in time based on the estimated CFR, time from onset-to-outcome and recent case numbers.

Remove plotting functions

This issue is to request removal of the plotting functions in {datadelay}. The Epiverse-TRACE philosophy has moved to not including plotting functions in packages, as they add substantial development overhead, can be dependency heavy, and because good advice for plotting epidemiological data exists in resources such as the Epi R Handbook.

Avoiding use of time varying estimation for small datasets

For datasets with relatively small number of cases/deaths (like Ebola 1976 example in README), the estimate_time_varying function can be unstable, as it's distributing expected timings based on a small number of discrete events, rather than estimating the trend over events occuring daily (like the COVID timeseries).

Therefore, we should limit usage to estimate_static for these smaller datasets, e.g. in the README, as the earlier iteration of this package did: https://github.com/adamkucharski/ebola-cfr/blob/main/scripts/main_script.R

If we want to show a figure, we could show how the static CFR calculation changes as more and more data are included. This was in the above script: CFR_figure.pdf and in earlier version of this package as a plot, but may have been deprecated to prevent confusion between using an expanding time window of data to fit a static CFR (i.e. this PDF), and fitting a time-varying CFR to a fixed time window of data (i.e. estimate_time_varying ). However, given it nicely illustrates the difference between the naive and time-adjusted statitic method, maybe we should include again in the README.

Input data from `incidence2`

datadelay's functions to estimate CFR require a specific df format of date-onset-death columns. When using incidence2 to go from a linelist to the daily cases/death counts, these counts can be obtained in one step, but this results in a df with a long format (i.e. the variables "onset_date" and "death" appear as rows rather than columns), and it is then necessary to pivot the table so that this could be used as input for datadelay.
Alternatively, users could extract the case/death counts separately using incidence2 and then merge them into a new dataset, or attempt to do this manually without using this package.
In any of these cases, I think that it will be quite tedious for users to have to add these many lines of code to their scripts, for such a predictable task.
I believe it would be very useful to add a function to provide a df with the right format in one step, I'm unsure of which package this would belong to, whether incidence2 or datadelay, but after a conversation with @Bisaloo I'm raising it here for wider discussion.

Grouping for <incidence2> objects

This issue is about how the prepare_data() for <incidence2> objects treats grouped objects.

The current behaviour is to error when grouped <incidence2> objects are passed, and to advise users to call incidence2::regroup() on the object before passing it.

Alternatives include respecting the grouping structure, but this could require taking on a data science repository such as {data.table} or {dplyr} + {tidyr}, although base-R only options could also be implemented.

Can we open another issue for the incidence2 grouping discussion please to make sure this stays on our radar? I think it's an important point.

Originally posted by @Bisaloo in #39 (comment)

Re-allow passing delay function as alternative to epidist

This issue is based on a suggestion by @TimTaylor in issue #59 to allow users to pass a custom delay function as an alternative to providing an <epidist> to the epi_dist argument. This was the implementation from PR #11 until PR #22.

The proposed solution is to allow passing a delay_function argument (renamed from delay_pmf) to which users would pass a function that wraps the PMF/PDF function for a distribution; e.g., function(x) stats::dgamma(x, shape, rate), or function(x) stats::density(distribution, at = x) for <distributional> or similar objects.

Pass epidist object instead of delay_pmf

Functions proposed in #11 have arguments that accept a probability mass function of the onset to death. This function is typically taken from epiparameter::epidist objects.

It is not currently possible to check that the user has correctly passed an onset to death distribution rather than some other distribution associated with that pathogen. This is because the distribution type data present in epiparameter::epidist objects is lost when subsetting for $pmf.

This could be solved by passing the full epiparameter::epidist object instead, allowing better input checking.

Remove basic plotting sections in vignettes

I'm wondering about adding these [plotting] sections that are about plotting the data prior to applying the functions, it's nice but maybe the content of the vignettes should go a bit more straight to the point of demonstrating the functionality of the package and the plotting that's not showcasing the outputs of the functions would fit better on a tutorial or learning platform like Applied Epi?

Originally posted by @CarmenTamayo in #55 (comment)

Use covid data from OWID rather than {covidregionaldata}

This issue is to request that the dependency on {covidregionaldata} data be swapped out in favour of Covid-19 data from Our World in Data, via the {owidR} package. {owidR} is on CRAN, removing one blocker dependency (see relevant discussion linked below).

Yep, could import from any alternative COVID data source that has cases and deaths over sufficiently long period to expect a change from accumulation of immunity (e.g. timeseries from OWID? https://ourworldindata.org/coronavirus). {covidregionaldata} was used as illustrative example in early versions of package, so not reason this has to be the go-to dependency.

Originally posted by @adamkucharski in epiverse-trace/epiverse-trace.github.io#85 (reply in thread)

Fix vector length mismatch in estimate_severity()

In estimate_severity(), the vector u_t is multiplied with the vector pprange? The warnings on CI checks in #23 result from these being of unequal length - u_t for the Ebola data (ebola1976) is 37 values, while pprange is 1000 values - should they be the same length, should u_t be interpolated to a length of 1000 maybe?

Either way, this vector length mismatch throws warnings that should be fixed.

Deconvolution vs backwards sampling

Nice to see this project kicking off. I see that the MVP currently uses backwards sampling which has a range of bias issues.

Here is some code for a matrix approach to convolution which should generally be more efficient as well as a deterministic deconvolution method that works by solving for the matrix inverse. Note this example is tuned for ONS prevalence as we were aiming to recreate the method they use to get incidence (which they confirmed we had done via email).

https://gist.github.com/seabbs/fb1bc9c79c3dd7117f9314cb97e71615

(note this comes with no licensing for free reuse etc without attribution).

Release cfr 0.1.0

First release:

Prepare for release:

  • git pull
  • Check if any deprecation processes should be advanced, as described in Gradual deprecation
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE) : OK except 1. New package, 2. epiparameter dependency, 3. LaTeX error on R 4.2.1
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news_md()
  • git push
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Reduce or remove data download

This issue is to request that the package examples and vignettes should reduce the amount of data they download, as this causes delays in local testing and rendering of the package documentation. This could be a barrier for users working through the vignettes without reliable internet (currently the case in London).

Most data is downloaded from {covidregionaldata}, and alternatives to downloads such as making the data available from the package as with the 1976 ebola data should be considered.

Rethink `format_output()`

Is format_output() really necessary? Could we not return to the earlier implementation of the severity estimate as a named vector with three values? That handles the pretty printing issue to some extent.

Originally posted by @pratikunterwegs in #23 (comment)

Rethink plot_epiparameter_distribution()

I think {epiparameter} has a plot() method for epidist objects - I wonder whether plot_epiparameter_distribution() can then be removed. If this is a distinct method that achieves something quite different, it might be worth formally making this an S3 method for epidists.

Originally posted by @pratikunterwegs in #23 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.