Giter Site home page Giter Site logo

cancerprof's People

Contributors

realbp avatar seankross avatar

Stargazers

 avatar  avatar

cancerprof's Issues

When to skip certain tests

My instinct is that there are certain situations where you should skip the tests that send a request to the API. Likely these tests should be skipped on CRAN. @sckott What do you think of that assessment?

is Louisiana data missing?

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask in the Fred Hutch Data Slack https://hutchdatascience.org/joinslack/ or email us at [email protected].

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex start by reading https://www.tidyverse.org/help/#reprex

Brief description of the problem:

> incidence_cancer("LA", "county", "melanoma of the skin", "all races (includes hispanic)", "females", "all ages", "all stages", "latest 5 year average")
Error in `sym()`:
! Can't convert a character vector to a symbol.

Update documentation for sex argument

Sex is listed a 'male', 'female' or 'both sexes' in the documentation. Input required is plural - 'males' or 'females'

incidence_cancer(area='NC', areatype="hsa", cancer = "Melanoma of the Skin", race = "all races (includes hispanic)", sex = "male", age = "all ages", stage = "all stages", year = "latest 5 year average")

'Error in handle_sex(sex) :
Invalid sex input, please check the documentation for valid inputs'

installation issue for R 4.4.0

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask in the Fred Hutch Data Slack https://hutchdatascience.org/joinslack/ or email us at [email protected].

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex start by reading https://www.tidyverse.org/help/#reprex

Brief description of the problem:

The dependency package libcurl4-openssl-dev is not available for R 4.4.0.

> pak::pak("getwilds/cancerprof")
                                                                            
→ Will install 1 package.
→ The package (0 B) is cached.
+ cancerprof   0.1.0 [bld][cmp] (GitHub: 23dbd98)
✖ Missing 1 system package. You'll probably need to install it manually:
+ libcurl4-openssl-dev  - curl
ℹ No downloads are needed, 1 pkg is cached
ℹ Building cancerprof 0.1.0
✖ Failed to build cancerprof 0.1.0 (11.2s)                            
Error:                                                                
! error in pak subprocess
Caused by error in `stop_task_build(state, worker)`:
! Failed to build source package cancerprof.
Type .Last.error to see the more details.
> install.packages("libcurl4-openssl-dev")
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.4’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘libcurl4-openssl-dev’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
> 

examples and CRAN

On CRAN submission and their daily checks if examples fail you'll hear about it and have to fix it typically within 2 weeks. Just a note to think about WHEN examples are run

There's a number of solutions.

And it's not just whether they fail but if how long they take. I don't know exact numbers, but CRAN will complain if they take too long

Coerce column classes?

Perhaps this has already been litigated, but wouldn't it make sense to coerce columns of data.frame outputs to their proper types? e.g.,

x <- risk_women_health(women_health = "mammogram in past 2 years, ages 50-74",
                     race = "all races (includes hispanic)",
                     datatype = "direct estimates")
str(x)
#> 'data.frame':	52 obs. of  6 variables:
#>  $ State                : chr  "Wyoming" "Idaho" "Alaska" "Oklahoma" ...
#>  $ FIPS                 : chr  "56000" "16000" "02900" "40000" ...
#>  $ Percent              : chr  "65.2 " "68.5 " "68.8 " "69.3 " ...
#>  $ Lower_95%_CI         : chr  "61.8" "65.6" "65.0" "66.3" ...
#>  $ Upper_95%_CI         : chr  " 68.5" " 71.4" " 72.5" " 72.3" ...
#>  $ Number_of_Respondents: chr  "706" "933" "873" "882" ...

where I'd think the last 4 columns at least could safely be converted to numeric/integer

I don't see this as super important, but would make it easier for users once they have the data so they don't have to do this themselves

Reduce duplication

e..g, the 4 process_* functions. they share a lot of the same code, and that common code could be factored out. but I haven't looked at the 4 fxns in detail so it's possible it's not worth it

Selecting a state/territory abbreviation for area and areatype as 'state' pulls both DC and Puerto Rico information

When putting in a state/territory abbreviation code and selecting areatype='state' data from District of Columbia and Puerto Rico are returned.

Example:
incidence_cancer(area="NC", areatype="state", cancer = "Melanoma of the Skin", race = "all races (includes hispanic)", sex = "females", age = "all ages", stage = "all stages", year = "latest 5 year average")

I would expect this to throw an error instead of returning DC and Puerto Rico information. Unless features are upgraded to allow filtering of USA list- i.e. area = c('WA', 'NC', 'PR'), it would be helpful to specify in documentation that when specifying a state/territory in area, that areatype must be 'hsa' or 'county'

Use vcr

I'm not super familiar with this package, so I'd make sure to check whether vcr is appropriate first, and if so use it where it makes sense, and not use it, or use other testing approaches where it doesn't make sense

Historical Trends Data

Do we have a function to extract "Historical Trends" data? For ex, https://statecancerprofiles.cancer.gov/historicaltrend/index.php?0&9953&999&7599&001&001&28&0&0&0&1&1&1&1#results

The URL to "Export Data" on the "Historical Trends" page is not pretty, to say the least: https://statecancerprofiles.cancer.gov/historicaltrend/data.php/historicaltrend.csv?0&9953&999&7599&001&001&00&0&0&0&2&0&1&1&6.

If we don't have a function already, I'm not sure how to use httr2 to grab this data. I may have to scrape this data using rvest.

Request for default values for standard demographics

For race, sex, age, setting default values might make this package quicker and easier to use for those new to the data ('All Races (includes Hispanic)', 'both sexes', 'all ages'). This would be counter to the OS of the website as it requires individual selection for all these values but might make the package more user-friendly.

Depending on other users' requests/preferences could also consider defaulting for cancer options ('all cancer sites') and stage ('all stages').

Needs linting

See: https://lintr.r-lib.org/

  • Clean up the package via iterative lints: lintr::lint_package()
  • Set up linting via gh actions: usethis::use_github_action("lint")

Also @sckott may have tips. (I have only gotten serious about lintr since Scott joined DaSL)

Organize functions into fewer files

Currently, this package's R/ directory contains one file for each function. So, the file demo-crowding.R contains demo_crowding(), the file demo-education.R contains demo_education(), and so on.

According to the "R code" chapter in R Packages, this extreme is bad. I suggest we have a single .R file contain a family of related functions. For ex, demo.R would contain all the demo_*() functions, handle.R would contain all the handle_*() functions, and so on.

Metadata Parsing proposal

the metadata for each of the 4 categories in state cancer profiles is different so I think having different ways to handle the parse of the metadata would be needed. Using a list of strings and having a key and value might be the most useful for users to have control of the metadata.

demo_metadata_list <- list(
  input = c(
    "Demographic Data Report for Washington by County",
    "Education: Less than 9th grade",
    "Population Ages 25+",
    "All Races (includes Hispanic), Both Sexes",
    "2017-2021 American Community Survey 5-Year Data"
  ),
  sortedby = "Value",
  createdby = "statecancerprofiles.cancer.gov on 01/10/2024 6:16 pm.",
  data_sources = "Demographic data provided by the Census Bureau (http://www.census.gov/) & the American Community Survey (http://www.census.gov/acs/www/).",
  data_dictionary = "For more information about Education: Less than 9th grade, see the dictionary at http://statecancerprofiles.cancer.gov/dictionary.php#education.",
  data_limitations = "Data for United States does not include Puerto Rico."
)

Incidence and Mortality metadata is much more complex.

incd_metadata_list <- list(
  input = c("Incidence Rate Report for Washington by County",
            "Bladder (All Stages^), 2016-2020",
            "All Races (includes Hispanic), Both Sexes, Ages <50")
  sortedby = "Rate",
  createdBy = "Created by statecancerprofiles.cancer.gov on 02/23/2024 5:25 pm."
  trend = c(
    "Rising when 95% confidence interval of average annual percent change is above 0.",
    "Stable when 95% confidence interval of average annual percent change includes 0.",
    "Falling when 95% confidence interval of average annual percent change is below 0."
  ),
  rate_note = c(
    "Incidence rates (cases per 100,000 population per year) are age-adjusted to the 2000 US standard population.",
    "Rates are for invasive cancer only (except for bladder cancer which is invasive and in situ) or unless otherwise specified."
  ),
  trend_note = c(
    "Incidence data come from different sources.",
    "AAPCs are calculated by the Joinpoint Regression Program."
  ),
  data_sources = c(
    "National Program of Cancer Registries [https://www.cdc.gov/cancer/npcr/index.htm]",
    "Surveillance, Epidemiology, and End Results [http://seer.cancer.gov]",
    "SEER*Stat Database - United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute. Based on the 2022 submission.",
    "SEER November 2022 submission."
  ),
  data_dictionary = c(
    "Incidence rates (cases per 100,000 population per year) are age-adjusted to the 2000 US standard population.",
    "AAPCs are calculated by the Joinpoint Regression Program and are based on APCs."
  ),
  data_limitations = c(
    "Data has been suppressed to ensure confidentiality and stability of rate estimates.",
    "Data for the United States does not include data from Nevada.",
    "Data for the United States does not include Puerto Rico."
  )
)

Order of parameters

Typically - at least what I'm familiar with - required parameters go first, with optional parameters (including those with default values) last.

e.g., in demo_crowding you have

function(area, areatype, 
                          crowding = "household with >1 person per room", 
                          race)

whereas i'd want crowding to go last

function(area, areatype, race,
                          crowding = "household with >1 person per room")
                          

do other folks have other ideas?

vignettes and CRAN

On CRAN submission and their daily checks if vignettes fail to build you'll hear about it and have to fix it typically within 2 weeks.

There's probably a number of different solutions, but from my perspective you shouldn't submit a package to CRAN with vignettes that have to build. That is, prebuild them.

A solution I use with all my pkgs is

A small caveat with this is that your vignettes are not run on cran checks - and not built on normal R CMD BUILD, etc. But the alternative is that vignette builds can fail on CRAN and even if don't fail can take a long time, which they may also contact you about

`create_request()` has no return value

Currently, this is what create_request() looks like:

create_request <- function(topic) {
  url = "https://statecancerprofiles.cancer.gov/"
  url_end = "/index.php"
  
  url = paste0(url, topic, url_end)
  
  req <- request(url)
}

I was confused by this syntax as I'm used to isolating the value I want to return on the last line of a function. This SO post revealed that an R function returns the value of the last expression evaluated, which in our case would be request(url). However, since it is assigned to req, this sets the visibility to FALSE, so it doesn't display.

Unless you had specific reasons, I would get rid of the assignment and simply have request(url) as the last line of the function. This way, in the future, when we are debugging an error and running this function, we'll see the request object.

For ex,

> create_request("risk")
<httr2_request>
GET https://statecancerprofiles.cancer.gov/risk/index.php
Body: empty

Response URLs

Having read through the source code for demo_crowding() and demo_education() in the dev branch, I noticed that we assign a variable for resp$url:

resp_url <- resp$url

resp_url <- resp$url

I don't think this is necessary since the response URL shouldn't be different from the request URL (unless there were redirects, which we can safely assume there won't be).

We should be getting the same url as resp$url from req$url, so I don't think we need to assign a new variable, resp_url.

What do you think?

Questions about `process_response()`

Questions that arose from my genuine curiosity:

  1. On line 33, I was wondering why you used textConnection() and read.csv() to get data frame from resp_lines. I thought that performing some string gymnastics could have reformatted resp_lines into a data frame, but curious as to the logic behind this line.
  2. On lines 39 and 43. I was wondering why you used symbols instead of just filtering the column as-is.

Thanks!

Inconsistency in processing metadata

Today, I went through the source code for the demo_*() family of functions, and noticed an inconsistency. demo_crowding() and demo_education() closes with a line of code to process the metadata with process_metadata():

process_metadata(resp, "demographics", resp_url)

The other demographics functions do not process the metadata at the end. Is there a reason why they don't?

Same point holds for the risk_*() family of functions.

Error in `as_tibble()`

While reviewing #91, I ran into an error when running process_resp():

Error in as_tibble(.) : could not find function "as_tibble"

I think you just need to add @importFrom tibble as_tibble in the roxygen documentation for process_resp().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.