getwilds / cancerprof Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 7.15 MB

API Client for State Cancer Profiles

Home Page: http://getwilds.org/cancerprof/

License: Other

R 100.00%

wilds-r

cancerprof's People

Contributors

Stargazers

Forkers

howardbaek mtiryakiuw alman1

cancerprof's Issues

When to skip certain tests

My instinct is that there are certain situations where you should skip the tests that send a request to the API. Likely these tests should be skipped on CRAN. @sckott What do you think of that assessment?

Handling http errors

I don't see any explicit HTTP error handling in the package, but maybe it's not needed. I just dn't know enough about the website to know if it's needed. Do you ever end up with 400 or above series HTTP status codes? If so, should handle that kind of result in probably here https://github.com/getwilds/cancerprof/blob/main/R/process-response.R

is Louisiana data missing?

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask in the Fred Hutch Data Slack https://hutchdatascience.org/joinslack/ or email us at [email protected].

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex start by reading https://www.tidyverse.org/help/#reprex

Brief description of the problem:

> incidence_cancer("LA", "county", "melanoma of the skin", "all races (includes hispanic)", "females", "all ages", "all stages", "latest 5 year average")
Error in `sym()`:
! Can't convert a character vector to a symbol.

Add a description to the repo

so it populates at https://getwilds.org/packages.html

Update documentation for sex argument

Sex is listed a 'male', 'female' or 'both sexes' in the documentation. Input required is plural - 'males' or 'females'

incidence_cancer(area='NC', areatype="hsa", cancer = "Melanoma of the Skin", race = "all races (includes hispanic)", sex = "male", age = "all ages", stage = "all stages", year = "latest 5 year average")

'Error in handle_sex(sex) :
Invalid sex input, please check the documentation for valid inputs'

installation issue for R 4.4.0

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex start by reading https://www.tidyverse.org/help/#reprex

Brief description of the problem:

The dependency package libcurl4-openssl-dev is not available for R 4.4.0.

> pak::pak("getwilds/cancerprof")
                                                                            
→ Will install 1 package.
→ The package (0 B) is cached.
+ cancerprof   0.1.0 [bld][cmp] (GitHub: 23dbd98)
✖ Missing 1 system package. You'll probably need to install it manually:
+ libcurl4-openssl-dev  - curl
ℹ No downloads are needed, 1 pkg is cached
ℹ Building cancerprof 0.1.0
✖ Failed to build cancerprof 0.1.0 (11.2s)                            
Error:                                                                
! error in pak subprocess
Caused by error in `stop_task_build(state, worker)`:
! Failed to build source package cancerprof.
Type .Last.error to see the more details.

> install.packages("libcurl4-openssl-dev")
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.4’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘libcurl4-openssl-dev’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
>

examples and CRAN

On CRAN submission and their daily checks if examples fail you'll hear about it and have to fix it typically within 2 weeks. Just a note to think about WHEN examples are run

There's a number of solutions.

\dontrun as you use in some of your egs
@examplesIf https://roxygen2.r-lib.org/articles/rd.html#examples, this could be:
- @examplesIf interactive() - i think this would avoid cran running them, but not totally sure
- @examplesIf check_if_website_is_up()

And it's not just whether they fail but if how long they take. I don't know exact numbers, but CRAN will complain if they take too long

Coerce column classes?

Perhaps this has already been litigated, but wouldn't it make sense to coerce columns of data.frame outputs to their proper types? e.g.,

x <- risk_women_health(women_health = "mammogram in past 2 years, ages 50-74",
                     race = "all races (includes hispanic)",
                     datatype = "direct estimates")
str(x)
#> 'data.frame':	52 obs. of  6 variables:
#>  $ State                : chr  "Wyoming" "Idaho" "Alaska" "Oklahoma" ...
#>  $ FIPS                 : chr  "56000" "16000" "02900" "40000" ...
#>  $ Percent              : chr  "65.2 " "68.5 " "68.8 " "69.3 " ...
#>  $ Lower_95%_CI         : chr  "61.8" "65.6" "65.0" "66.3" ...
#>  $ Upper_95%_CI         : chr  " 68.5" " 71.4" " 72.5" " 72.3" ...
#>  $ Number_of_Respondents: chr  "706" "933" "873" "882" ...

where I'd think the last 4 columns at least could safely be converted to numeric/integer

I don't see this as super important, but would make it easier for users once they have the data so they don't have to do this themselves

Reduce duplication

e..g, the 4 process_* functions. they share a lot of the same code, and that common code could be factored out. but I haven't looked at the 4 fxns in detail so it's possible it's not worth it

Selecting a state/territory abbreviation for area and areatype as 'state' pulls both DC and Puerto Rico information

When putting in a state/territory abbreviation code and selecting areatype='state' data from District of Columbia and Puerto Rico are returned.

Example:
incidence_cancer(area="NC", areatype="state", cancer = "Melanoma of the Skin", race = "all races (includes hispanic)", sex = "females", age = "all ages", stage = "all stages", year = "latest 5 year average")

I would expect this to throw an error instead of returning DC and Puerto Rico information. Unless features are upgraded to allow filtering of USA list- i.e. area = c('WA', 'NC', 'PR'), it would be helpful to specify in documentation that when specifying a state/territory in area, that areatype must be 'hsa' or 'county'

Use vcr

I'm not super familiar with this package, so I'd make sure to check whether vcr is appropriate first, and if so use it where it makes sense, and not use it, or use other testing approaches where it doesn't make sense

Historical Trends Data

Do we have a function to extract "Historical Trends" data? For ex, https://statecancerprofiles.cancer.gov/historicaltrend/index.php?0&9953&999&7599&001&001&28&0&0&0&1&1&1&1#results

The URL to "Export Data" on the "Historical Trends" page is not pretty, to say the least: https://statecancerprofiles.cancer.gov/historicaltrend/data.php/historicaltrend.csv?0&9953&999&7599&001&001&00&0&0&0&2&0&1&1&6.

If we don't have a function already, I'm not sure how to use httr2 to grab this data. I may have to scrape this data using rvest.

Request for default values for standard demographics

For race, sex, age, setting default values might make this package quicker and easier to use for those new to the data ('All Races (includes Hispanic)', 'both sexes', 'all ages'). This would be counter to the OS of the website as it requires individual selection for all these values but might make the package more user-friendly.

Depending on other users' requests/preferences could also consider defaulting for cancer options ('all cancer sites') and stage ('all stages').

add a package level man file

https://usethis.r-lib.org/reference/use_package_doc.html

good idea to have this manual file so users can do ?cancerprof and get a high level overview of the package

Needs linting

See: https://lintr.r-lib.org/

Clean up the package via iterative lints: lintr::lint_package()
Set up linting via gh actions: usethis::use_github_action("lint")

Also @sckott may have tips. (I have only gotten serious about lintr since Scott joined DaSL)

Organize functions into fewer files

Currently, this package's R/ directory contains one file for each function. So, the file demo-crowding.R contains demo_crowding(), the file demo-education.R contains demo_education(), and so on.

According to the "R code" chapter in R Packages, this extreme is bad. I suggest we have a single .R file contain a family of related functions. For ex, demo.R would contain all the demo_*() functions, handle.R would contain all the handle_*() functions, and so on.

output_metadata_list recent trend is getting deleted after adding a new class.

@seankross
could you take a look at get-metadata.R lines 86 - 137?
https://github.com/getwilds/cancerprof/blob/646a242744f5422fc36283be61f01ef2db8faddb/R/get-metadata.R#L86C1-L137

I am trying to add recent_trends as a dataframe inside a list of strings. On line 119, output_metadata_list prints out correctly, but once I add the new class on line 135, recent_trends disappears from output_metadata_list.

Metadata Parsing proposal

the metadata for each of the 4 categories in state cancer profiles is different so I think having different ways to handle the parse of the metadata would be needed. Using a list of strings and having a key and value might be the most useful for users to have control of the metadata.

demo_metadata_list <- list(
  input = c(
    "Demographic Data Report for Washington by County",
    "Education: Less than 9th grade",
    "Population Ages 25+",
    "All Races (includes Hispanic), Both Sexes",
    "2017-2021 American Community Survey 5-Year Data"
  ),
  sortedby = "Value",
  createdby = "statecancerprofiles.cancer.gov on 01/10/2024 6:16 pm.",
  data_sources = "Demographic data provided by the Census Bureau (http://www.census.gov/) & the American Community Survey (http://www.census.gov/acs/www/).",
  data_dictionary = "For more information about Education: Less than 9th grade, see the dictionary at http://statecancerprofiles.cancer.gov/dictionary.php#education.",
  data_limitations = "Data for United States does not include Puerto Rico."
)

Incidence and Mortality metadata is much more complex.

incd_metadata_list <- list(
  input = c("Incidence Rate Report for Washington by County",
            "Bladder (All Stages^), 2016-2020",
            "All Races (includes Hispanic), Both Sexes, Ages <50")
  sortedby = "Rate",
  createdBy = "Created by statecancerprofiles.cancer.gov on 02/23/2024 5:25 pm."
  trend = c(
    "Rising when 95% confidence interval of average annual percent change is above 0.",
    "Stable when 95% confidence interval of average annual percent change includes 0.",
    "Falling when 95% confidence interval of average annual percent change is below 0."
  ),
  rate_note = c(
    "Incidence rates (cases per 100,000 population per year) are age-adjusted to the 2000 US standard population.",
    "Rates are for invasive cancer only (except for bladder cancer which is invasive and in situ) or unless otherwise specified."
  ),
  trend_note = c(
    "Incidence data come from different sources.",
    "AAPCs are calculated by the Joinpoint Regression Program."
  ),
  data_sources = c(
    "National Program of Cancer Registries [https://www.cdc.gov/cancer/npcr/index.htm]",
    "Surveillance, Epidemiology, and End Results [http://seer.cancer.gov]",
    "SEER*Stat Database - United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute. Based on the 2022 submission.",
    "SEER November 2022 submission."
  ),
  data_dictionary = c(
    "Incidence rates (cases per 100,000 population per year) are age-adjusted to the 2000 US standard population.",
    "AAPCs are calculated by the Joinpoint Regression Program and are based on APCs."
  ),
  data_limitations = c(
    "Data has been suppressed to ensure confidentiality and stability of rate estimates.",
    "Data for the United States does not include data from Nevada.",
    "Data for the United States does not include Puerto Rico."
  )
)

Order of parameters

Typically - at least what I'm familiar with - required parameters go first, with optional parameters (including those with default values) last.

e.g., in demo_crowding you have

function(area, areatype, 
                          crowding = "household with >1 person per room", 
                          race)

whereas i'd want crowding to go last

function(area, areatype, race,
                          crowding = "household with >1 person per room")

do other folks have other ideas?

vignettes and CRAN

On CRAN submission and their daily checks if vignettes fail to build you'll hear about it and have to fix it typically within 2 weeks.

There's probably a number of different solutions, but from my perspective you shouldn't submit a package to CRAN with vignettes that have to build. That is, prebuild them.

A solution I use with all my pkgs is

a make command or similar like https://github.com/getwilds/rcromwell/blob/main/Makefile#L33-L37 that builds the .Rmd file from another file, in my case .Rmd.og (og is for "original").
then your .Rmd.og file like https://github.com/getwilds/rcromwell/blob/main/vignettes/rcromwell.Rmd.og becomes the .Rmd file like https://github.com/getwilds/rcromwell/blob/main/vignettes/rcromwell.Rmd - the code blocks are already rendered, so when CRAN builds/checks your pkg the code in vignettes isn't actually run

A small caveat with this is that your vignettes are not run on cran checks - and not built on normal R CMD BUILD, etc. But the alternative is that vignette builds can fail on CRAN and even if don't fail can take a long time, which they may also contact you about

`create_request()` has no return value

Currently, this is what create_request() looks like:

create_request <- function(topic) {
  url = "https://statecancerprofiles.cancer.gov/"
  url_end = "/index.php"
  
  url = paste0(url, topic, url_end)
  
  req <- request(url)
}

I was confused by this syntax as I'm used to isolating the value I want to return on the last line of a function. This SO post revealed that an R function returns the value of the last expression evaluated, which in our case would be request(url). However, since it is assigned to req, this sets the visibility to FALSE, so it doesn't display.

Unless you had specific reasons, I would get rid of the assignment and simply have request(url) as the last line of the function. This way, in the future, when we are debugging an error and running this function, we'll see the request object.

For ex,

> create_request("risk")
<httr2_request>
GET https://statecancerprofiles.cancer.gov/risk/index.php
Body: empty

Response URLs

Having read through the source code for demo_crowding() and demo_education() in the dev branch, I noticed that we assign a variable for resp$url:

cancerprof/R/demo-crowding.R

Line 72 in c51290c

resp_url <- resp$url

cancerprof/R/demo-education.R

Line 92 in c51290c

resp_url <- resp$url

I don't think this is necessary since the response URL shouldn't be different from the request URL (unless there were redirects, which we can safely assume there won't be).

We should be getting the same url as resp$url from req$url, so I don't think we need to assign a new variable, resp_url.

What do you think?

Questions about `process_response()`

Questions that arose from my genuine curiosity:

On line 33, I was wondering why you used textConnection() and read.csv() to get data frame from resp_lines. I thought that performing some string gymnastics could have reformatted resp_lines into a data frame, but curious as to the logic behind this line.
On lines 39 and 43. I was wondering why you used symbols instead of just filtering the column as-is.

Thanks!

Documentation notes

strongly suggest reusing documentation whenever possible to reduce duplicated docs, see https://roxygen2.r-lib.org/articles/reuse.html
might have more ... placeholder

Inconsistency in processing metadata

Today, I went through the source code for the demo_*() family of functions, and noticed an inconsistency. demo_crowding() and demo_education() closes with a line of code to process the metadata with process_metadata():

cancerprof/R/demo-crowding.R

Line 85 in beffd3c

process_metadata(resp, "demographics", resp_url)

The other demographics functions do not process the metadata at the end. Is there a reason why they don't?

Same point holds for the risk_*() family of functions.

Mysterious file

What is this file used for?
.Rapp.history: https://github.com/getwilds/cancerprof/blob/main/.Rapp.history

Error in `as_tibble()`

While reviewing #91, I ran into an error when running process_resp():

Error in as_tibble(.) : could not find function "as_tibble"

I think you just need to add @importFrom tibble as_tibble in the roxygen documentation for process_resp().

pkgdown site?

If there will be one https://pkgdown.r-lib.org/ ...

consider using @family to group related fxns so users can find them easier, e.g., all the handle_ fxns
...

getwilds / cancerprof Goto Github PK

cancerprof's People

Contributors

Stargazers

Forkers

cancerprof's Issues

Recommend Projects

Recommend Topics

Recommend Org